通过python获取网页的链接url,返回码,以及相关的信息
- #对反爬虫网页,可以设置一些headers信息,模拟成浏览器取访问网站
- import urllib.request
- url="https://www.baidu.com/"
- file=urllib.request.urlopen(url)
- print('获取当前url:',file.geturl() )
- print('file.getcode,HTTPResponse类型:',file.getcode )
- print('file.info 返回当前环境相关的信息:' ,file.info())
输出结果:
- D:\工具\pythonTools\CatchTest1101\venv\Scripts\python.exe D:/工具/pythonTools/CatchTest1101/venv/test/test110204.py
- 获取当前url: https://www.baidu.com/
- file.getcode,HTTPResponse类型: <bound method HTTPResponse.getcode of <http.client.HTTPResponse object at 0x00000264C14A4940>>
- file.info 返回当前环境相关的信息: Accept-Ranges: bytes
- Cache-Control: no-cache
- Content-Length: 227
- Content-Type: text/html
- Date: Fri, 02 Nov 2018 03:01:23 GMT
- Etag: "5bd7d86c-e3"
- Last-Modified: Tue, 30 Oct 2018 04:05:00 GMT
- P3p: CP=" OTI DSP COR IVA OUR IND COM "
- Pragma: no-cache
- Server: BWS/1.1
- Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
- Set-Cookie: BIDUPSID=ED1C1F4FD9C3CBA5268DC8CB64DEEA6C; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
- Set-Cookie: PSTM=1541127683; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
- Strict-Transport-Security: max-age=0
- X-Ua-Compatible: IE=Edge,chrome=1
- Connection: close
-
-
-
- Process finished with exit code 0
-
浏览器输入网址——F12——查看network——左边浏览器——查看headers头部
复制
- #上边这个headers对应的存储user-agent信息,定义格式为:“User-Agent”,具体信息,获取一次即可,不需要每次通过F12获取
- #对反爬虫网页,可以设置一些headers信息,模拟成浏览器取访问网站
- import urllib.request
- url="https://gsh.cdsy.xyz"
- file=urllib.request.urlopen(url)
- print('获取当前url:',file.geturl() )
- print('file.getcode,HTTPResponse类型:',file.getcode )
- print('file.info 返回当前环境相关的信息:' ,file.info())
-
-
- #爬虫模式浏览访问网页设置方法
- #User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
- #上边这个headers对应的存储user-agent信息,定义格式为:“User-Agent”,具体信息,获取一次即可,不需要每次通过F12获取
- #1,使用build_opener()修改报头
- headers = ("User-Agent"," Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")
- opener = urllib.request.build_opener()
- opener.addheaders = [headers]
- data=opener.open(url).read()
- fhandle=open('D:/爬虫/抓取文件/2018110204.html','wb')
- fhandle.write(data)
- fhandle.close()
-
打开抓取到的文件,查看文件,成功抓取到内容
- urllib.request.Request()下的add_header()实现浏览器模式
- #request对象名.add_header(字段名,字段值)
- #方法二:使用add_header()添加报头
- import urllib.request
- url="https://gsh.cdsy.xyz"
- req=urllib.request.Request(url)
- #实现浏览器模式
- req.add_header("User-Agent"," Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")
- data=urllib.request.urlopen(req).read()
#成功设置好了报头,成功模拟浏览器爬取网址信息。写入User-Agent,避免了403错误。