Logging

Scrapy提供了log功能。您可以通过 logging 模块使用。

Log levels

Scrapy提供5层logging级别:

  1. CRITICAL - 严重错误(critical)
  2. ERROR - 一般错误(regular errors)
  3. WARNING - 警告信息(warning messages)
  4. INFO - 一般信息(informational messages)
  5. DEBUG - 调试信息(debugging messages)

默认情况下python的logging模块将日志打印到了标准输出中,且只显示了大于等于WARNING级别的日志,这说明默认的日志级别设置为WARNING(日志级别等级CRITICAL > ERROR > WARNING > INFO > DEBUG,默认的日志格式为DEBUG级别

如何设置log级别

您可以通过终端选项(command line option) --loglevel/-L 或 LOG_LEVEL 来设置log级别。

  • scrapy crawl tencent_crawl -L INFO

  • 可以修改配置文件settings.py,添加

LOG_LEVEL='INFO'

在Spider中添加log

Scrapy为每个Spider实例记录器提供了一个logger,可以这样访问:

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['http://scrapinghub.com']

    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

logger是用Spider的名称创建的,但是你可以用你想要的任何自定义logging
例如:

import logging
import scrapy

logger = logging.getLogger('zhangsan')

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['http://scrapinghub.com']

    def parse(self, response):
        logger.info('Parse function called on %s', response.url)

Logging设置

以下设置可以被用来配置logging:

LOG_ENABLED

默认: True,启用logging

LOG_ENCODING

默认: 'utf-8',logging使用的编码

LOG_FILE

默认: None,logging输出的文件名

LOG_LEVEL

默认: 'DEBUG',log的最低级别

LOG_STDOUT

默认: False

如果为 True,进程所有的标准输出(及错误)将会被重定向到log中。例如,执行 print 'hello' ,其将会在Scrapy log中显示。

案例(一)

tencent_crawl.py添加日志信息如下:

    '''
    添加日志信息
    '''
    print 'print',response.url

    self.logger.info('info on %s', response.url)
    self.logger.warning('WARNING on %s', response.url)
    self.logger.debug('info on %s', response.url)
    self.logger.error('info on %s', response.url)

完整版如下:

# -*- coding:utf-8 -*-
import scrapy
from tutorial.items import RecruitItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class RecruitSpider(CrawlSpider):
    name = "tencent_crawl"
    allowed_domains = ["hr.tencent.com"]
    start_urls = [
        "http://hr.tencent.com/position.php?&start=0#a"
    ]


    #提取匹配 'http://hr.tencent.com/position.php?&start=\d+'的链接
    page_lx = LinkExtractor(allow=('start=\d+'))

    rules = [
        #提取匹配,并使用spider的parse方法进行分析;并跟进链接(没有callback意味着follow默认为True)
        Rule(page_lx, callback='parseContent',follow=True)
    ]

    def parseContent(self, response):

        #print("print settings: %s" % self.settings['LOG_FILE'])
        '''
        添加日志信息
        '''
        print 'print',response.url

        self.logger.info('info on %s', response.url)
        self.logger.warning('WARNING on %s', response.url)
        self.logger.debug('info on %s', response.url)
        self.logger.error('info on %s', response.url)


        for sel in response.xpath('//*[@class="even"]'):
            name = sel.xpath('./td[1]/a/text()').extract()[0]
            detailLink = sel.xpath('./td[1]/a/@href').extract()[0]
            catalog =None
            if sel.xpath('./td[2]/text()'):
                catalog = sel.xpath('./td[2]/text()').extract()[0]

            recruitNumber = sel.xpath('./td[3]/text()').extract()[0]
            workLocation = sel.xpath('./td[4]/text()').extract()[0]
            publishTime = sel.xpath('./td[5]/text()').extract()[0]
            #print name, detailLink, catalog,recruitNumber,workLocation,publishTime
            item = RecruitItem()
            item['name']=name
            item['detailLink']=detailLink
            if catalog:
                item['catalog']=catalog
            item['recruitNumber']=recruitNumber
            item['workLocation']=workLocation

            item['publishTime']=publishTime
            yield item
  • 在settings文件中,修改添加信息

      LOG_FILE='ten.log'
      LOG_LEVEL='INFO'
    

    接下来执行:scrapy crawl tencent_crawl

  • 或者command line命令行执行:

    scrapy crawl tencent_crawl --logfile 'ten.log' -L INFO
    

输出如下

print http://hr.tencent.com/position.php?start=10
print http://hr.tencent.com/position.php?start=1340
print http://hr.tencent.com/position.php?start=0
print http://hr.tencent.com/position.php?start=1320
print http://hr.tencent.com/position.php?start=1310
print http://hr.tencent.com/position.php?start=1300
print http://hr.tencent.com/position.php?start=1290
print http://hr.tencent.com/position.php?start=1260

ten.log文件中记录,可以看到级别大于INFO日志输出

2016-08-15 23:10:57 [tencent_crawl] INFO: info on http://hr.tencent.com/position.php?start=70
2016-08-15 23:10:57 [tencent_crawl] WARNING: WARNING on http://hr.tencent.com/position.php?start=70
2016-08-15 23:10:57 [tencent_crawl] ERROR: info on http://hr.tencent.com/position.php?start=70
2016-08-15 23:10:57 [tencent_crawl] INFO: info on http://hr.tencent.com/position.php?start=1320
2016-08-15 23:10:57 [tencent_crawl] WARNING: WARNING on http://hr.tencent.com/position.php?start=1320
2016-08-15 23:10:57 [tencent_crawl] ERROR: info on http://hr.tencent.com/position.php?start=1320

案例(二)

tencent_spider.py添加日志信息如下:

logger = logging.getLogger('zhangsan')
    '''
    添加日志信息
    '''
    print 'print',response.url

    self.logger.info('info on %s', response.url)
    self.logger.warning('WARNING on %s', response.url)
    self.logger.debug('info on %s', response.url)
    self.logger.error('info on %s', response.url)

完整版如下:

import scrapy
from tutorial.items import RecruitItem
import re
import logging

logger = logging.getLogger('zhangsan')

class RecruitSpider(scrapy.spiders.Spider):
    name = "tencent"
    allowed_domains = ["hr.tencent.com"]
    start_urls = [
        "http://hr.tencent.com/position.php?&start=0#a"
    ]

    def parse(self, response):
        #logger.info('spider tencent Parse function called on %s', response.url)
        '''
        添加日志信息
        '''
        print 'print',response.url

        logger.info('info on %s', response.url)
        logger.warning('WARNING on %s', response.url)
        logger.debug('info on %s', response.url)
        logger.error('info on %s', response.url)

        for sel in response.xpath('//*[@class="even"]'):
            name = sel.xpath('./td[1]/a/text()').extract()[0]
            detailLink = sel.xpath('./td[1]/a/@href').extract()[0]
            catalog =None
            if sel.xpath('./td[2]/text()'):
                catalog = sel.xpath('./td[2]/text()').extract()[0]

            recruitNumber = sel.xpath('./td[3]/text()').extract()[0]
            workLocation = sel.xpath('./td[4]/text()').extract()[0]
            publishTime = sel.xpath('./td[5]/text()').extract()[0]
            #print name, detailLink, catalog,recruitNumber,workLocation,publishTime
            item = RecruitItem()
            item['name']=name
            item['detailLink']=detailLink
            if catalog:
                item['catalog']=catalog
            item['recruitNumber']=recruitNumber
            item['workLocation']=workLocation
            item['publishTime']=publishTime
            yield item

        nextFlag = response.xpath('//*[@id="next"]/@href')[0].extract()
        if 'start' in nextFlag:
            curpage = re.search('(\d+)',response.url).group(1)
            page =int(curpage)+10
            url = re.sub('\d+',str(page),response.url)
            print url
            yield scrapy.Request(url, callback=self.parse)
  • 在settings文件中,修改添加信息

      LOG_FILE='tencent.log'
      LOG_LEVEL='WARNING'
    

    接下来执行:scrapy crawl tencent

  • 或者command line命令行执行:

    scrapy crawl tencent --logfile 'tencent.log' -L WARNING
    

输出信息

print http://hr.tencent.com/position.php?&start=0
http://hr.tencent.com/position.php?&start=10
print http://hr.tencent.com/position.php?&start=10
http://hr.tencent.com/position.php?&start=20
print http://hr.tencent.com/position.php?&start=20
http://hr.tencent.com/position.php?&start=30

tencent.log文件中记录,可以看到级别大于INFO日志输出

2016-08-15 23:22:59 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=0
2016-08-15 23:22:59 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=0
2016-08-15 23:22:59 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=10
2016-08-15 23:22:59 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=10

小试 LOG_STDOUT

settings.py

LOG_FILE='tencent.log'
LOG_STDOUT=True
LOG_LEVEL='INFO'
scrapy crawl tencent

输出:

tencent.log日志文件

2016-08-15 23:28:32 [stdout] INFO: http://hr.tencent.com/position.php?&start=110
2016-08-15 23:28:32 [stdout] INFO: print
2016-08-15 23:28:32 [stdout] INFO: http://hr.tencent.com/position.php?&start=110
2016-08-15 23:28:32 [zhangsan] INFO: info on http://hr.tencent.com/position.php?&start=110
2016-08-15 23:28:32 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=110
2016-08-15 23:28:32 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=110
2016-08-15 23:28:32 [stdout] INFO: http://hr.tencent.com/position.php?&start=120
2016-08-15 23:28:33 [stdout] INFO: print
2016-08-15 23:28:33 [stdout] INFO: http://hr.tencent.com/position.php?&start=120
2016-08-15 23:28:33 [zhangsan] INFO: info on http://hr.tencent.com/position.php?&start=120
2016-08-15 23:28:33 [zhangsan] WARNING: WARNING on http://hr.tencent.com/position.php?&start=120
2016-08-15 23:28:33 [zhangsan] ERROR: info on http://hr.tencent.com/position.php?&start=120

scrapy之Logging使用

#coding:utf-8
######################
##Logging的使用
######################
import logging
'''
1. logging.CRITICAL - for critical errors (highest severity) 致命错误
2. logging.ERROR - for regular errors 一般错误
3. logging.WARNING - for warning messages 警告+错误
4. logging.INFO - for informational messages 消息+警告+错误
5. logging.DEBUG - for debugging messages (lowest severity) 低级别
'''
logging.warning("This is a warning")

logging.log(logging.WARNING,"This is a warning")

#获取实例对象
logger=logging.getLogger()
logger.warning("这是警告消息")
#指定消息发出者
logger = logging.getLogger('SimilarFace')
logger.warning("This is a warning")

#在爬虫中使用log
import scrapy
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://scrapinghub.com']
    def parse(self, response):
        #方法1 自带的logger
        self.logger.info('Parse function called on %s', response.url)
        #方法2 自己定义个logger
        logger.info('Parse function called on %s', response.url)

'''
Logging 设置
• LOG_FILE
• LOG_ENABLED
• LOG_ENCODING
• LOG_LEVEL
• LOG_FORMAT
• LOG_DATEFORMAT • LOG_STDOUT

命令行中使用
--logfile FILE
Overrides LOG_FILE

--loglevel/-L LEVEL
Overrides LOG_LEVEL

--nolog
Sets LOG_ENABLED to False
'''

import logging
from scrapy.utils.log import configure_logging

configure_logging(install_root_handler=False)
#定义了logging的些属性
logging.basicConfig(
    filename='log.txt',
    format='%(levelname)s: %(levelname)s: %(message)s',
    level=logging.INFO
)
#运行时追加模式
logging.info('进入Log文件')
logger = logging.getLogger('SimilarFace')
logger.warning("也要进入Log文件")