爬取糗事百科段子

确定URL并抓取页面代码,首先我们确定好页面的URL是 http://www.qiushibaike.com/8hr/page/4, 其中最后一个数字1代表页数,我们可以传入不同的值来获得某一页的段子内容。

我们初步构建如下的代码来打印页面代码内容试试看,先构造最基本的页面抓取方式,看看会不会成功

在Composer raw 模拟发送数据

GET http://www.qiushibaike.com/8hr/page/2/ HTTP/1.1
Host: www.qiushibaike.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept-Language: zh-CN,zh;q=0.8

在删除了User-Agent、Accept-Language报错

应该是headers验证的问题,加上一个headers验证试试看

# -*- coding:utf-8 -*-
import urllib
import requests
import re
import chardet
from lxml import etree

page = 2
url = 'http://www.qiushibaike.com/8hr/page/' + str(page) + "/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
    'Accept-Language': 'zh-CN,zh;q=0.8'}
try:
    response = requests.get(url, headers=headers)
    resHtml = response.text

    html = etree.HTML(resHtml)
    result = html.xpath('//div[contains(@id,"qiushi_tag")]')
    for site in result:
        #print etree.tostring(site,encoding='utf-8')

        item = {}
        imgUrl = site.xpath('./div/a/img/@src')[0].encode('utf-8')
        username = site.xpath('./div/a/@title')[0].encode('utf-8')
        #username = site.xpath('.//h2')[0].text
        content = site.xpath('.//div[@class="content"]')[0].text.strip().encode('utf-8')
        vote = site.xpath('.//i')[0].text
        #print site.xpath('.//*[@class="number"]')[0].text
        comments = site.xpath('.//i')[1].text

        print imgUrl, username, content, vote, comments

except Exception, e:
    print e

好啦,大家来测试一下吧,点一下回车会输出一个段子,包括发布人,发布时间,段子内容以及点赞数,是不是感觉爽爆了!