Scrapy cookies浅析

首先打消大家的疑虑, Scrapy会自动管理cookies, 就像浏览器一样:

Does Scrapy manage cookies automatically?

Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.

Cookies的管理是通过CookiesMiddleware, 它属于DownloadMiddleware的一部分, 所有的requests和response都要经过它的处理.

首先我们看处理request的部分

代码:

class CookiesMiddleware(object):
    """This middleware enables working with sites that need cookies"""

    def __init__(self, debug=False):
    # 用字典生成多个cookiesjar
        self.jars = defaultdict(CookieJar)
        self.debug = debug



    def process_request(self, request, spider):
        if request.meta.get('dont_merge_cookies', False):
            return
        # 每个cookiesjar的key都存储在 meta字典中
        cookiejarkey = request.meta.get("cookiejar")
        jar = self.jars[cookiejarkey]
        cookies = self._get_request_cookies(jar, request)
        # 把requests的cookies存储到cookiesjar中
        for cookie in cookies:
            jar.set_cookie_if_ok(cookie, request)

        # set Cookie header
        # 删除原有的cookies
        request.headers.pop('Cookie', None)
        # 添加cookiesjar中的cookies到requests header
        jar.add_cookie_header(request)
        self._debug_cookie(request, spider)

流程如下:

  • 使用字典初始化多个cookies jar
  • 把每个requests指定的cookies jar 提取出来
  • 然后根据policy把requests中的cookies添加cookies jar
  • 最后把cookies jar中合适的cookies添加到requests首部

接下来看看如何处理response中的cookies:

    def process_response(self, request, response, spider):
        if request.meta.get('dont_merge_cookies', False):
            return response

        # extract cookies from Set-Cookie and drop invalid/expired cookies
        cookiejarkey = request.meta.get("cookiejar")
        jar = self.jars[cookiejarkey]
        jar.extract_cookies(response, request)
        self._debug_set_cookie(response, spider)

        return response

流程如下:

  • 首先从cookies jar 字典中把requests对应的cookiesjar提取出来.
  • 使用extract_cookies把response首部中的cookies添加到cookies jar