如何“跟踪”和“过滤”

在很多情况下，我们并不是只抓取某个页面，而需要“顺藤摸瓜”，从几个种子页面，通过超级链接索，最终定位到我们想要的页面。

Scrapy对这个功能进行了很好的抽象：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
class Coder4Spider(CrawlSpider):
      name = 'coder4'
      allowed_domains = ['xxx.com']
      start_urls = ['http://www.xxx.com']
      rules = ( 
      Rule(SgmlLinkExtractor(allow=('page/[0-9]+', ))),
      Rule(SgmlLinkExtractor(allow=('archives/[0-9]+', )), callback='parse_item'),
      )   
      def parse_item(self, response):
        self.log('%s' % response.url)

在上面，我们用了CrawlSpider而不是Spider，name、 allowed_domains、start_urls就不解释了。

重点说下Rule：

第1条不带callback的，表示只是“跳板”，即只下载网页并根据allow中匹配的链接，去继续遍历下一步的页面，实际上Rule还可以指定deny=xxx 表示过滤掉哪些页面。
第2条带callback的，是最终会回调parse_item函数的网页。