Python-Goose - Article Extractor
Goose Extractor是什么?
Goose Extractor是一个Python的开源文章提取库。可以用它提取文章的文本内容、图片、视频、元信息和标签。Goose本来是由Gravity.com编写的Java库,最近转向了scala。
Goose Extractor网站是这么介绍的:
'Goose Extractor完全用Python重写了。目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息。'
官方网站:
https://github.com/grangier/python-goose
Setup
mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install
Configuration
There are two ways to pass configuration to goose. The first one is to pass goose a Configuration() object. The second one is to pass a configuration dict.
For instance, if you want to change the userAgent used by Goose just pass:
g = Goose({'browser_user_agent': 'Mozilla'})
Switching parsers : Goose can now be used with lxml html parser or lxml soup parser. By default the html parser is used. If you want to use the soup parser pass it in the configuration dict :
g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})
Goose is now language aware
For example, scraping a Spanish content page with correct meta language tags:
from goose import Goose
url = 'https://www.yahoo.com/news/keegan-michael-key-brings-back-201647910.html?nhp=1'
g = Goose()
article = g.extract(url=url)
article.title
u"Keegan-Michael Key Brings Back Obama's Hilarious Anger Translator for RNC"
article.cleaned_text
u"Keegan-Michael Key does an impersonation of his role as President Obama's 'anger translator' from the 2015 White House Correspondents Dinner during Popcorn With Peter Travers."
article.infos
{'authors': [],
'cleaned_text': u"Keegan-Michael Key does an impersonation of his role as President Obama's 'anger translator' from the 2015 White House Correspondents Dinner during Popcorn With Peter Travers.",
'domain': 'www.yahoo.com',
'image': {'height': 0, 'type': 'image', 'url': '1280', 'width': 0},
'links': [],
'meta': {'canonical': 'https://www.yahoo.com/news/keegan-michael-key-brings-back-201647910.html',
'description': "Keegan-Michael Key does an impersonation of his role as President Obama's 'anger translator' from the 2015 White House Correspondents Dinner during Popcorn With Peter Travers.",
'favicon': 'https://s.yimg.com/os/mit/media/p/common/images/favicon_new-7483e38.svg',
'keywords': '',
'lang': 'en'},
'movies': [],
'opengraph': {'description': "Keegan-Michael Key does an impersonation of his role as President Obama's 'anger translator' from the 2015 White House Correspondents Dinner during Popcorn With Peter Travers.",
'image': 'https://s.yimg.com/uu/api/res/1.2/lSnhGN5TgE81cDklUN91jg--/aD03MjA7dz0xMjgwO3NtPTE7YXBwaWQ9eXRhY2h5b24-/http://media.zenfs.com/en-US/video/video.abcnewsplus.com/d5fbdafba5110e27aaf0fb4084967e0c',
'title': "Keegan-Michael Key Brings Back Obama's Hilarious Anger Translator for RNC",
'type': 'article',
'url': 'https://www.yahoo.com/news/keegan-michael-key-brings-back-201647910.html'},
'publish_date': None,
'tags': [],
'title': u"Keegan-Michael Key Brings Back Obama's Hilarious Anger Translator for RNC",
'tweets': []}
Goose in Chinese
Some users want to use Goose for Chinese content. Chinese word segmentation is way more difficult to deal with than occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object.
from goose import Goose
from goose.text import StopWordsChinese
url = 'http://world.huanqiu.com/exclusive/2016-07/9209839.html'
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url=url)
print article.cleaned_text[:150]
【环球时报综合报道】针对美国共和党全国代表大会审议通过的新党纲在台湾、涉藏、经贸、南海等问题上出现干涉中国内政、指责中国政策的内容,中国外交部发言人陆慷20日回应说,推动中美关系稳定发展符合两国根本利益,有利于亚太地区乃至世界的和平与发展,是双方应该坚持的正确方向。美国无论哪个党派,都应该客观、理性
print article.meta_description
美国共和党新党纲“21次提及中国”。
print article.meta_keywords
中国,美共和党,党纲,内政
Known issues
There are some issues with unicode URLs.
Cookie handling : Some websites need cookie handling. At the moment the only work around is to use the raw_html extraction. For instance:
import urllib2
import goose
url = "http://oversea.huanqiu.com/article/2016-07/9198141.html"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.infos
{'authors': [],
'cleaned_text': u'',
'domain': None,
'image': None,
'links': [],
'meta': {'canonical': None,
'description': u'\u4f60\u771f\u4e22\u4eba\uff01\u4e2d\u56fd\u7528\u5927\u5c4f\u5e55\u66dd\u5149\u201c\u8001\u8d56\u201d',
'favicon': 'http://himg2.huanqiu.com/statics/images/favicon1.ico',
'keywords': u'\u73af\u7403\u7f51',
'lang': None},
'movies': [],
'opengraph': {},
'publish_date': None,
'tags': [],
'title': u'\u6cf0\u5a92\uff1a\u4f60\u771f\u4e22\u4eba\uff01\u4e0a\u6d77\u7528\u5927\u5c4f\u5e55\u66dd\u5149\u201c\u8001\u8d56\u201d_\u6d77\u5916\u770b\u4e2d\u56fd_\u73af\u7403\u7f51',
'tweets': []}
a.meta_keywords u'\u8521\u82f1\u6587,\u4e5d\u4e8c\u5171\u8bc6'