这是 newspaper 的github链接:
这是 newspaper文档说明的链接:
https://newspaper.readthedocs.io/en/latest/
这是 newspaper快速入门的链接:
https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html
pip3 install newspaper3k
主要功能如下:
介绍:
import newspaper web_paper = newspaper.build("http://www.sxdi.gov.cn/gzdt/jlsc/", language="zh", memoize_articles=false)
注:文章缓存:默认情况下,newspaper缓存所有以前提取的文章,并删除它已经提取的任何文章。此功能用于防止重复的文章和提高提取速度。可以使用memoize_articles参数选择退出此功能。
for article in web_paper.articles: print(article.url) output: http://www.sxdi.gov.cn/gzdt/jlsc/2019101220009.html http://www.sxdi.gov.cn/gzdt/jlsc/2019101119998.html http://www.sxdi.gov.cn/gzdt/jlsc/2019100919989.html http://www.sxdi.gov.cn/gzdt/jlsc/2019100819980.html http://www.sxdi.gov.cn/gzdt/jlsc/2019092919940.html http://www.sxdi.gov.cn/gzdt/jlsc/2019092919933.html ....
for category in web_paper.category_urls(): print(category) output: http://www.sxdi.gov.cn/gzdt/jlsc/....
for feed_url in web_paper.feed_urls(): print(feed_url)
5.提取源品牌和描述
print(web_paper.brand) # 品牌 print(web_paper.description) # 描述 print("一共获取%s篇文章" % web_paper.size()) # 文章的数目
from newspaper import article article = article("http://www.sol.com.cn/", language='zh') # chinese article.download()
article.parse() #网页解析 print("title=",article.title) # 获取文章标题 print("author=", article.authors) # 获取文章作者 print("publish_date=", article.publish_date) # 获取文章日期 print("top_iamge=",article.top_image) # 获取文章顶部图片地址 print("movies=",article.movies) # 获取文章视频链接 print("text=",article.text,"\n") # 获取文章正文 article.nlp() print('keywords=',article.keywords)#从文本中提取关键字 print("summary=",article.summary)# 获取文章摘要 print("images=",article.images)#从html中提取所有图像 print("imgs=",article.imgs) print("html=",article.html)#获取html
import newspaper from newspaper import article def spider_newspaper_url(url): """ 默认情况下,newspaper缓存所有以前提取的文章,并删除它已经提取的任何文章。 使用memoize_articles参数选择退出此功能。 """ web_paper = newspaper.build(url, language="zh", memoize_articles=false) print("提取新闻页面的url!!!") for article in web_paper.articles: # 获取新闻网页的url print("新闻页面url:", article.url) # 调用spider_newspaper_information函数获取新闻网页数据 spider_newspaper_information(article.url) print("一共获取%s篇文章" % web_paper.size()) # 文章的数目 # 获取文章的信息 def spider_newspaper_information(url): # 建立链接和下载文章 article = article(url, language='zh') # chinese article.download() article.parse() # 获取文章的信息 print("title=", article.title) # 获取文章标题 print("author=", article.authors) # 获取文章作者 print("publish_date=", article.publish_date) # 获取文章日期 # print("top_iamge=", article.top_image) # 获取文章顶部图片地址 # print("movies=", article.movies) # 获取文章视频链接 print("text=", article.text, "\n") # 获取文章正文 print("summary=", article.summary) # 获取文章摘要 if __name__ == "__main__": web_lists = ["http://www.sxdi.gov.cn/gzdt/jlsc/","http://www.people.com.cn/gb/59476/"] for web_list in web_lists: spider_newspaper_url(web_list)
如对本文有疑问, 点击进行留言回复!!
Python之数据分析(中位数、波动范围、极差、离差、方差、标准差)
python如何将两个索引相同的列表的数据合并成一个新列表及绘图注意事项
网友评论