python获取网页header头部信息（python小白学习笔记二）

时间：08-23来源：作者：点击数：36

方法一：代码查看

通过python获取网页的链接url，返回码，以及相关的信息

#对反爬虫网页，可以设置一些headers信息，模拟成浏览器取访问网站
import urllib.request
url="https://www.baidu.com/"
file=urllib.request.urlopen(url)
print('获取当前url:',file.geturl() )
print('file.getcode,HTTPResponse类型:',file.getcode )
print('file.info 返回当前环境相关的信息：' ,file.info())

输出结果：

D:\工具\pythonTools\CatchTest1101\venv\Scripts\python.exe D:/工具/pythonTools/CatchTest1101/venv/test/test110204.py
获取当前url: https://www.baidu.com/
file.getcode,HTTPResponse类型: <bound method HTTPResponse.getcode of <http.client.HTTPResponse object at 0x00000264C14A4940>>
file.info 返回当前环境相关的信息： Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Fri, 02 Nov 2018 03:01:23 GMT
Etag: "5bd7d86c-e3"
Last-Modified: Tue, 30 Oct 2018 04:05:00 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=ED1C1F4FD9C3CBA5268DC8CB64DEEA6C; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1541127683; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Strict-Transport-Security: max-age=0
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close



Process finished with exit code 0

方法二：通过浏览器查看使用build_opener()修改报头

浏览器输入网址——F12——查看network——左边浏览器——查看headers头部

复制

#上边这个headers对应的存储user-agent信息，定义格式为：“User-Agent”,具体信息，获取一次即可，不需要每次通过F12获取

#对反爬虫网页，可以设置一些headers信息，模拟成浏览器取访问网站
import urllib.request
url="https://gsh.cdsy.xyz"
file=urllib.request.urlopen(url)
print('获取当前url:',file.geturl() )
print('file.getcode,HTTPResponse类型:',file.getcode )
print('file.info 返回当前环境相关的信息：' ,file.info())


#爬虫模式浏览访问网页设置方法
#User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
#上边这个headers对应的存储user-agent信息，定义格式为：“User-Agent”,具体信息，获取一次即可，不需要每次通过F12获取
#1，使用build_opener()修改报头
headers = ("User-Agent"," Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data=opener.open(url).read()
fhandle=open('D:/爬虫/抓取文件/2018110204.html','wb')
fhandle.write(data)
fhandle.close()

打开抓取到的文件，查看文件，成功抓取到内容

方法三：通过add_header()添加报头

urllib.request.Request（）下的add_header()实现浏览器模式

#request对象名.add_header(字段名，字段值)

#方法二：使用add_header()添加报头
import urllib.request
url="https://gsh.cdsy.xyz"
req=urllib.request.Request(url)
#实现浏览器模式
req.add_header("User-Agent"," Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36")
data=urllib.request.urlopen(req).read()

#成功设置好了报头，成功模拟浏览器爬取网址信息。写入User-Agent，避免了403错误。

方便获取更多学习、工作、生活信息请关注本站微信公众号 城东书院微信服务号