如何过滤重复的页面

Scrapy支持通过RFPDupeFilter来完成页面的去重(防止重复抓取)。

RFPDupeFilter实际是根据request_fingerprint实现过滤的,实现如下:

def request_fingerprint(request, include_headers=None):
    if include_headers:
        include_headers = tuple([h.lower() for h in sorted(include_headers)])
    cache = _fingerprint_cache.setdefault(request, {}) 
    if include_headers not in cache:
          fp = hashlib.sha1()
          fp.update(request.method)
          fp.update(canonicalize_url(request.url))
          fp.update(request.body or '') 
          if include_headers:
            for hdr in include_headers:
                  if hdr in request.headers:
                    fp.update(hdr)
                    for v in request.headers.getlist(hdr):
                          fp.update(v)
          cache[include_headers] = fp.hexdigest()
    return cache[include_headers]

我们可以看到,去重指纹是sha1(method + url + body + header)

所以,实际能够去掉重复的比例并不大。

如果我们需要自己提取去重的finger,需要自己实现Filter,并配置上它。

下面这个Filter只根据url去重:

from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
      """A dupe filter that considers the URL"""
      def __init__(self, path=None):
        self.urls_seen = set()
        RFPDupeFilter.__init__(self, path)
      def request_seen(self, request):
        if request.url in self.urls_seen:
              return True
        else:
              self.urls_seen.add(request.url)

不要忘记配置上:

DUPEFILTER_CLASS ='scraper.custom_filters.SeenURLFilter'