爬虫学习day2

2020-05-16 17:52:06 阅读：299 来源： 互联网

标签：day2 urlopen Request request 爬虫 urllib 学习 headers response

一、Request对象

　　通过构建一个request，服务器响应请求得到应答，这样显得逻辑上清晰明确

from urllib.request import urlopen
from urllib.request import Request

request = Request("http://www.baidu.com")
response = urlopen(requst)
print response.read().decode()

二、GET请求

　　大部分被传输到浏览器的html，images，js，css, … 都是通过GET方法发出请求的。它是获取数据的主要方法

from urllib.request import Request,urlopen
from urllib.parse import quote

url = 'https://www.baidu.com/s?wd={}'.format(quote("哔哩哔哩"))
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
}

requset = Request(url,headers=headers)
response = urlopen(requset)
print(response.read().decode())

　　这里没添加绕过机制被ban了，但大致就是这么个流程

三、POST请求

　　Request请求对象的里有data参数，它就是用在POST里的，我们要传送的数据就是这个参数data，data是一个字典，里面要匹配键值对，一般地，POST在我们登陆的时候使用

from urllib.request import Request,urlopen
from urllib.parse import urlencode

url = "http://192.168.43.167/DVWA/login.php"
data = {
    "username":"admin",
    "password":"password"
}
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
}
data = urlencode(data)
request = Request(url,data=data.encode(),headers=headers)
response = urlopen(request)
print(response.read().decode())

四、Ajax的请求获取数据

　　有些网页内容使用AJAX加载，而AJAX一般返回的是JSON,直接对AJAX地址进行post或get，就返回JSON数据了

from urllib.request import Request, urlopen

base_url = "https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start={}&limit=20"
i = 0
while True:
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
    }
    url = base_url.format(i * 20)
    request = Request(url, headers=headers)
    response = urlopen(request)
    info = response.read().decode()
    print(info)

    #当读到完时跳出
    if info == "" or info is None:
        break
    i += 1

五、SSL

　　 https 开头的网站，urllib可以为 HTTPS 请求验证SSL证书，就像web浏览器一样，如果网站的SSL证书是经过CA认证的，则能够正常访问，如：https://www.baidu.com/

　　如果SSL证书验证不通过，或者操作系统不信任服务器的安全证书，比如浏览器在访问12306网站如：https://www.12306.cn/mormhweb/的时候，会警告用户证书不受信任。（据说 12306 网站证书是自己做的，没有通过CA认证）

# 忽略SSL安全认证
context = ssl._create_unverified_context()
# 添加到context参数里
response = urllib.request.urlopen(request, context = context)

from urllib.request import Request, urlopen
import ssl

url = 'https://www.12306.cn/mormhweb/'
headers = {
     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
}
request = Request(url, headers=headers)
# 忽略验证证书
context = ssl._create_unverified_context()
response = urlopen(request, context=context)
info = response.read().decode()
print(info)

标签：day2,urlopen,Request,request,爬虫,urllib,学习,headers,response
来源： https://www.cnblogs.com/Lee-404/p/12901474.html