惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

罗磊的独立博客
SecWiki News
SecWiki News
酷 壳 – CoolShell
酷 壳 – CoolShell
爱范儿
爱范儿
量子位
M
MIT News - Artificial intelligence
GbyAI
GbyAI
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
TaoSecurity Blog
TaoSecurity Blog
博客园 - 【当耐特】
H
Heimdal Security Blog
腾讯CDC
The Last Watchdog
The Last Watchdog
Security Archives - TechRepublic
Security Archives - TechRepublic
Hacker News: Ask HN
Hacker News: Ask HN
S
Schneier on Security
Microsoft Security Blog
Microsoft Security Blog
WordPress大学
WordPress大学
博客园 - 司徒正美
Recent Commits to openclaw:main
Recent Commits to openclaw:main
C
Cybersecurity and Infrastructure Security Agency CISA
S
SegmentFault 最新的问题
大猫的无限游戏
大猫的无限游戏
Application and Cybersecurity Blog
Application and Cybersecurity Blog
F
Full Disclosure
有赞技术团队
有赞技术团队
T
Tailwind CSS Blog
Engineering at Meta
Engineering at Meta
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threatpost
月光博客
月光博客
A
Arctic Wolf
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
雷峰网
雷峰网
T
Troy Hunt's Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The Cloudflare Blog
D
DataBreaches.Net
O
OpenAI News
L
LINUX DO - 最新话题
宝玉的分享
宝玉的分享
小众软件
小众软件
V
Vulnerabilities – Threatpost
A
About on SuperTechFans
人人都是产品经理
人人都是产品经理
T
The Exploit Database - CXSecurity.com
Martin Fowler
Martin Fowler
美团技术团队
P
Privacy International News Feed

博客园 - newr2006

Android adb.exe 开发模试安装 jquery check box Fiddler Post Debug symbol MC 3090 upgrade to symbol MC 3190 ALTER TABLE unable to add host to SCVMM 2008R2 Cannot generate SSPI context ASP.NET代码对页面输出进行清理 - newr2006 - 博客园 提前两天发邮件 线程 Thread 传参数 好的博客 Login failed for user 'NT AUTHORITY\NETWORK SERVICE'. 解决办法 - newr2006 Hashtable(HashSet),ListDictionary,HybridDictionary 和 NameValueCollection Pocket pc 与 Smartphone 开发的区别 XMl 文件属性的读取 USA 的网站终于把中国的名字排上去了. Menu 控件弹出窗口(popupwindow) 删除一些难删除的程序 Page_ClientValidate 用法
HttpWebRequest 模拟浏览器访问网站
newr2006 · 2018-06-22 · via 博客园 - newr2006

最近抓网页时报错:

要么返回 The remote server returned an error: (442)
要么返回: 非法访问,您的行为已被WAF系统记录!

想了想,就当是人家加了抓网页的东西,于是改了一下方法 加上Request.Header 之类的东西就行了。

具体加什么,咱们可以先用 fildder 抓一下包就可以了如:

GET http://www.baidu.com/ HTTP/1.1
Host: www.baidu.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9

  
 1 public static string GetHtml()
 2         {
 3             string url = "http://www.baidu.com";
 4             string Html = string.Empty;//初始化新的webRequst
 5             HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(url);
 6             Request.Timeout = 300000;
 7             Request.ReadWriteTimeout = 300000;
 8          //   Request.ImpersonationLevel = TokenImpersonationLevel.Anonymous;
 9           
10             Request.Headers.Add("Accept-Language", "zh-cn,en-us;q=0.5");
11           //  Request.Headers.Add("Accept-Encoding", "gzip, deflate");
12       
13             Request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;           
14             Request.KeepAlive = true;
15             Request.ProtocolVersion = HttpVersion.Version11;
16             Request.Method = "GET";
17             Request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
18             Request.Host = "www.baidu.com";
19             //Request.Accept = "text/json,*/*;q=0.5";
20             //Request.Headers.Add("Accept-Charset", "utf-8;q=0.7,*;q=0.7");
21             //Request.Headers.Add("Accept-Encoding", "gzip, deflate, x-gzip, identity; q=0.9");
22             Request.UserAgent = @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36";
23             Request.Referer = url;
24             Request.IfModifiedSince = DateTime.UtcNow;
25 
26             HttpWebResponse htmlResponse = (HttpWebResponse)Request.GetResponse();
27             //从Internet资源返回数据流
28              Stream htmlStream = htmlResponse.GetResponseStream();
29            // Stream htmlStream = new System.IO.Compression.GZipStream(htmlResponse.GetResponseStream(), System.IO.Compression.CompressionMode.Decompress);
30             //读取数据流
31             StreamReader weatherStreamReader = new StreamReader(htmlStream, Encoding.GetEncoding("gb2312"));
32             //读取数据
33             Html = weatherStreamReader.ReadToEnd();
34             weatherStreamReader.Close();
35             htmlStream.Close();
36             htmlResponse.Close();
37             //针对不同的网站查看html源文件
38             return Html;
39         }      

再加一段PHP的代码: 在不修改本页面utf-8编码的情况下如何让抓取的gb2312页面不乱码。

$headers = array();
$headers[] = 'X-Apple-Tz: 0';
$headers[] = 'X-Apple-Store-Front: 143444,12';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Accept-Language: en-US,en;q=0.5';
$headers[] = 'Cache-Control: no-cache';
$headers[] = 'Content-Type: application/x-www-form-urlencoded; charset=gb2312';//utf-8
$headers[] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36';

$dat = cUrlGetData($url, $post_fields, $headers);
function cUrlGetData($url, $post_fields = null, $headers = null) { $ch = curl_init(); $timeout = 50000; curl_setopt($ch, CURLOPT_URL, $url); if ($post_fields && !empty($post_fields)) { curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); } if ($headers && !empty($headers)) { curl_setopt($ch, CURLOPT_HTTPHEADER, $headers); } curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');//这个是解释gzip内容................. $data = curl_exec($ch); if (curl_errno($ch)) { echo 'Error:' . curl_error($ch); } curl_close($ch); return $data; } //php脚本开始 /*POST请求远程内容函数*/ function ppost($url,$data,$ref){ // 模拟提交数据函数 $curl = curl_init(); // 启动一个CURL会话 curl_setopt($curl, CURLOPT_URL, $url); // 要访问的地址 curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); // 对认证证书来源的检查 curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 1); // 从证书中检查SSL加密算法是否存在 curl_setopt($curl, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // 模拟用户使用的浏览器 curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); // 使用自动跳转 curl_setopt($curl, CURLOPT_REFERER, $ref); curl_setopt($curl, CURLOPT_POST, 1); // 发送一个常规的Post请求 curl_setopt($curl, CURLOPT_POSTFIELDS, $data); // Post提交的数据包 curl_setopt($curl, CURLOPT_COOKIEFILE,$GLOBALS ['cookie_file']); // 读取上面所储存的Cookie信息 curl_setopt($curl, CURLOPT_COOKIEJAR, $GLOBALS['cookie_file']); // 存放Cookie信息的文件名称 curl_setopt($curl, CURLOPT_HTTPHEADER,array('Accept-Encoding: gzip, deflate')); curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');//这个是解释gzip内容................. curl_setopt($curl, CURLOPT_TIMEOUT, 30); // 设置超时限制防止死循环 curl_setopt($curl, CURLOPT_HEADER, 0); // 显示返回的Header区域内容 curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); // 获取的信息以文件流的形式返回 $tmpInfo = curl_exec($curl); // 执行操作 if (curl_errno($curl)) { echo 'Errno'.curl_error($curl); } curl_close($curl); // 关键CURL会话 return $tmpInfo; // 返回数据 }