HTTP客户端

1. 前言

日常生活中，最为常见的HTTP客户端是浏览器，它主要负责读取指定url的html，以及相应的js、css、image等文件，并渲染出Web页面给用户。

在python中，使用HTTP客户端的场景主要是为了获取特定url的返回（如cgi请求），或是爬取页面等，一般不会去执行渲染逻辑。

python系统库中，也提供了http相关的库，如：urllib、httplib等。此外，还有一些比较出名的第三方库，如：requests等。

2. urllib2

在python2中用的比较多的是urllib2，但是在python3中被改成了urllib.request。可以使用以下代码进行兼容：

try:
    import urllib2
except ImportError:
    import urllib.request as urllib2

2.1. 读取百度主页的例子

def simple_request():
    url = 'http://www.baidu.com/'
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    print('http code: %d' % response.code)
    body = response.read()
    print('http body: %s' % body)

抓包看下本次请求的数据包：

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.baidu.com
User-Agent: Python-urllib/3.7
Connection: close

可以看出，这个例子中没有使用任何自定义的HTTP头，但是在实际请求时，我们一般会加上一些HTTP头，以便让请求看起来更像是浏览器发出的。

2.2. 增加HTTP头

修改`User-Agent`

默认的User-Agent一看就知道不是浏览器发出的，我们可以改成浏览器使用的值。

获取浏览器User-Agent的方法：

使用浏览器访问HTTP页面，然后抓包看下
在浏览器的命令行（如Chrome开发者工具Console标签页）中执行navigator.userAgent

增加以下代码：

request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')

修改`Referer`

有些页面会检查Referer字段，以防止CSRF攻击，这种情况下可以使用添加Referer头的方式来解决。

request.add_header('Referer', 'http://www.baidu.com/')

修改`Cookie`

许多网站需要先登录才能执行操作，此时，我们需要将登录态的Cookie添加到Cookie头中。但是这种Cookie一般都会有时间限制，比较通用的解决方法是：用脚本实现登录过程，并保存Cookie。

request.add_header('Cookie', 'BAIDUID=24572D33CC6DF235F7455D80132B39F1:FG=1;')

修改后的完整代码如下：

# 请求百度首页
def simple_request():
    url = 'http://www.baidu.com/'
    request = urllib2.Request(url)
    request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
    request.add_header('Referer', 'http://www.baidu.com/')
    request.add_header('Cookie', 'BAIDUID=24572D33CC6DF235F7455D80132B39F1:FG=1;')
    response = urllib2.urlopen(request)
    print('http code: %d' % response.code)
    body = response.read()
    print('http body: %s' % body)

执行后抓包结果如下：

GET / HTTP/1.1
Accept-Encoding: identity
Host: www.baidu.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
Referer: http://www.baidu.com/
Cookie: BAIDUID=24572D33CC6DF235F7455D80132B39F1:FG=1;
Connection: close

可以看出，HTTP头已经修改成功了。

2.3. 发送POST请求

urlopen函数的定义如下：

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):

第二个参数就是指定要发送的数据，当该参数不为空时，urllib2会自动改为发送POST请求。

下面以百度短网址接口为例（查看接口文档），文档中要求发送POST请求，并需要设置Content-Type头为application/json，以及增加Token头用于身份校验，body中使用的是包含url字段的json字符串。

# 发送POST请求创建百度短链接
def post_request():
    url = 'https://dwz.cn/admin/v2/create'
    request = urllib2.Request(url)
    request.add_header('Content-Type', 'application/json')
    request.add_header('Token', '5170738b0ccd9e9b6a31130116ab9ffb')
    data = {
        'url': 'http://www.baidu.com/'
    }
    response = urllib2.urlopen(request, json.dumps(data).encode('utf8'))
    print('http code: %d' % response.code)
    body = response.read()
    print('http body: %s' % body)

返回结果如下：

http code: 200
http body: b'{"Code":0,"IsNew":true,"ShortUrl":"https://dwz.cn/E7Y29crR","LongUrl":"http://www.baidu.com/","ErrMsg":""}'

可以看到，请求成功了。

2.4. 指定HTTP请求类型

在类似RESTful接口中，使用了如：PUT、DELETE等类型请求。要发送该类请求，可以使用如下方式：

request = urllib2.Request(url)
request.get_method = lambda: 'PUT'

2.5. 禁止重定向跳转

默认情况下，发送请求时遇到301和302返回码后，会自动跳转到新的url。但是有些情况下，我们并不希望请求发生自动跳转。

此时。可以使用以下方法解决：

class RedirectHandler(urllib2.HTTPRedirectHandler):
    '''disable redirect
    '''
    def http_error_301(self, req, fp, code, msg, headers):
        pass

    def http_error_302(self, req, fp, code, msg, headers):
        pass

opener = urllib2.build_opener(RedirectHandler)
urllib2.install_opener(opener)

这种方法的唯一缺点是：修改了全局配置，可能导致多线程冲突。

2.6. 使用Fiddler抓取HTTP请求

当Fiddler开启代理之后，使用urllib2发送请求，请求会被捕获并显示出来。但这样会有一个问题：浏览器同时也在发送各种请求，Fiddler上显示的请求会非常多，预期请求不太容易看到。

这里提供另外一种抓取请求的方法：Fiddler打开后，不需要启用代理，然后使用以下代码。

debug_proxy_addr = 'http://127.0.0.1:8888'
debug_proxy = urllib2.ProxyHandler({"http" : debug_proxy_addr, "https" : debug_proxy_addr})
if debug:
    opener = urllib2.build_opener(debug_proxy)
    urllib2.install_opener(opener)

debug用于表示当前是否是调试状态，这样，只要将debug设置为True，发送的请求就会自动被fiddler捕获；将debug设置为False后，fiddler不会捕获请求。

这段代码本来是用来设置HTTP(S)代理的，这里利用了Fiddler本身也是代理服务器的特点，以实现抓包功能。

urllib2.build_opener支持传入多个Handler，程序中可以根据情况传入不同的Handler列表。

注意：如果要捕获的是HTTPS请求，需要设置为忽略SSL证书错误。

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

2.7. 处理Cookie

Cookie一般用于跟踪会话（Session），即便不需要登录，很多网站也会设置Cookie，甚至用Cookie进行请求的合法性验证。

一般第一次访问的时候，服务端检测到Cookie为空，会自动生成Cookie，并在HTTP包中返回；客户端在之后发送的请求中，都应该带上这些Cookie。

在urllib2中一般是使用CookieJar和HTTPCookieProcessor进行Cookie的管理。

CookieJar在python2中是在cookielib模块里，python3则挪到了http.cookiejar里。它主要负责Cookie的解析和存储，包含多个子类，比较常用的是LWPCookieJar。

可以使用以下代码进行兼容：

try:
    import cookielib as cookiejar
except ImportError:
    import http.cookiejar as cookiejar

HTTPCookieProcessor是Cookie处理器，需要传入CookieJar实例进行初始化。

下面是使用的示例代码：

cookie = cookiejar.LWPCookieJar(cookie_file)
cookie_handler = urllib2.HTTPCookieProcessor(cookie)
opener = urllib2.build_opener(cookie_handler)
urllib2.install_opener(opener)

2.8. 解析HTTP返回包

在HTTP协议简介中，介绍了计算包体长度的方法。但是urlib2已经帮我们实现了这部分的功能，我们只需要调用：response.read()，就可以获取到包体数据了。

但是，此时包体有可能还是做了内容编码，需要进行解码。

解码 gzip

可以使用以下代码进行解码：

import gzip
gzip.decompress(response)

解码 deflate

import zlib
try:
    return zlib.decompress(data, -zlib.MAX_WBITS)
except zlib.error:
    return zlib.decompress(data)

3. requests

requests库是目前使用最为频繁的http客户端库了，该库使用简单，易于上手，官方文档地址为：http://2.python-requests.org/zh_CN/latest/user/quickstart.html。

3.1. 读取百度主页的例子

import requests

rsp = requests.get('http://www.baidu.com/')
rsp.raise_for_status()
print('http code: %d' % response.status_code)
body = rsp.content
print('http body: %s' % body)

相比于urllib2，requests在使用上更加简洁。

3.2. 增加HTTP头

# 请求百度首页
def simple_request():
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        'referer': 'http://www.baidu.com/'
    }
    rsp = requests.get('http://www.baidu.com/', headers=headers)
    print('http code: %d' % rsp.status_code)
    body = rsp.content
    print('http body: %s' % body)

3.3. 发送POST请求

# 发送POST请求创建百度短链接
def post_request():
    url = 'https://dwz.cn/admin/v2/create'
    headers = {
        'Content-Type': 'application/json',
        'Token': '5170738b0ccd9e9b6a31130116ab9ffb'
    }
    rsp = requests.post(url, headers=headers, json={  
        'url': 'http://www.baidu.com/'
    })
    print('http code: %d' % rsp.status_code)
    body = rsp.content
    print('http body: %s' % body)

发送普通POST请求时，一般是设置data参数为一个字典，底层会将其进行URL编码；对于发送json格式数据的场景，可以通过json参数指定。

3.4. 指定HTTP请求类型

不同的请求类型都有对应的方法：如put、delete等。

3.5. 禁止重定向跳转

通过指定allow_redirects参数为False

3.6. 配置代理

proxies = {
    'http': 'http://localhost:8888',
    'https': 'http://localhost:8888'
}
requests.get('http://www.baidu.com/', proxies=proxies)

3.7. 处理Cookie

def cookie_request():
    jar = requests.cookies.RequestsCookieJar()
    jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
    jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
    url = 'http://httpbin.org/cookies'
    r = requests.get(url, cookies=jar)
    print(r.text)

输出结果为：

{"cookies": {"tasty_cookie": "yum"}}

3.8. 使用Session

通常我们会使用相同的设置去访问同一个网站，比如HTTP头、Cookie等。此时，我们可以创建一个Session对象来实现这种需求。

def session_request():
    session = requests.Session()
    session.headers.update({'x-test': 'true'})

    rsp = session.get('http://httpbin.org/headers')
    print(rsp.text)

返回结果如下：

{
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Cache-Control": "max-age=259200",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Test": "true"
  }
}

4. tornado

tornado是一款高效的异步网络处理库，当然也包括HTTP客户端的能力。

4.1. 读取百度主页的例子

def simple_request():
    cli = tornado.httpclient.HTTPClient()
    response = cli.fetch("http://www.baidu.com/")
    print(response.body)
    cli.close()

上面的写法是同步，事实上在tornado中用的更多地是异步写法。

async def simple_async_request():
    cli = tornado.httpclient.AsyncHTTPClient()
    response = await cli.fetch("http://www.oa.com")
    print(response.body)
    cli.close()

HTTP客户端

HTTP客户端

1. 前言

2. urllib2

2.1. 读取百度主页的例子

2.2. 增加HTTP头

修改`User-Agent`

修改`Referer`

修改`Cookie`

2.3. 发送POST请求

2.4. 指定HTTP请求类型

2.5. 禁止重定向跳转

2.6. 使用Fiddler抓取HTTP请求

2.7. 处理Cookie

2.8. 解析HTTP返回包

解码 gzip

解码 deflate

3. requests

3.1. 读取百度主页的例子

3.2. 增加HTTP头

3.3. 发送POST请求

3.4. 指定HTTP请求类型

3.5. 禁止重定向跳转

3.6. 配置代理

3.7. 处理Cookie

3.8. 使用Session

4. tornado

4.1. 读取百度主页的例子

4.2. 配置HTTP代理

4.3. 增加HTTP头

results matching ""

No results matching ""

HTTP客户端

1. 前言

2. urllib2

2.1. 读取百度主页的例子

2.2. 增加HTTP头

修改User-Agent

修改Referer

修改Cookie

2.3. 发送POST请求

2.4. 指定HTTP请求类型

2.5. 禁止重定向跳转

2.6. 使用Fiddler抓取HTTP请求

2.7. 处理Cookie

2.8. 解析HTTP返回包

解码 gzip

解码 deflate

3. requests

3.1. 读取百度主页的例子

3.2. 增加HTTP头

3.3. 发送POST请求

3.4. 指定HTTP请求类型

3.5. 禁止重定向跳转

3.6. 配置代理

3.7. 处理Cookie

3.8. 使用Session

4. tornado

4.1. 读取百度主页的例子

4.2. 配置HTTP代理

4.3. 增加HTTP头

results matching ""

No results matching ""

修改`User-Agent`

修改`Referer`

修改`Cookie`