当前位置：移动技术网 > IT编程>脚本编程>Python > 使用Newspaper3k框架快速抓取文章信息

使用Newspaper3k框架快速抓取文章信息

2019年10月15日 | 移动技术网IT编程 | 我要评论

一、框架介绍

newspaper是一个python3库,但是newspaper框架并不适用于实际工程类新闻信息爬取工作，框架不稳定，爬取过程中会有各种bug，例如获取不到url、新闻信息等，但对于想获取一些新闻语料的朋友不妨一试，简单方便易上手，且不需要掌握太多关于爬虫方面的专业知识。

这是 newspaper 的github链接:

https://github.com/codelucas/newspaper

这是 newspaper文档说明的链接:

https://newspaper.readthedocs.io/en/latest/

这是 newspaper快速入门的链接:

https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html

安装方法：

pip3 install newspaper3k

二、功能

主要功能如下:

多线程文章下载框架
新闻网址识别
从html中提取文本
从html中提取顶部图像
从html中提取所有图像
从文本中提取关键字
从文本中提取摘要
从文本中提取作者
google趋势术语提取。
使用10种以上语言（英语，中文，德语，阿拉伯语……）

介绍:

1.建立新闻来源

import newspaper
web_paper = newspaper.build("http://www.sxdi.gov.cn/gzdt/jlsc/", language="zh", memoize_articles=false)

注：文章缓存：默认情况下，newspaper缓存所有以前提取的文章，并删除它已经提取的任何文章。此功能用于防止重复的文章和提高提取速度。可以使用memoize_articles参数选择退出此功能。

2.提取文章的url

for article in web_paper.articles:
    print(article.url)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/2019101220009.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019101119998.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100919989.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019100819980.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919940.html
http://www.sxdi.gov.cn/gzdt/jlsc/2019092919933.html
....

3.提取源类别

for category in web_paper.category_urls():
    print(category)
output:
http://www.sxdi.gov.cn/gzdt/jlsc/....

4.提取源提要

for feed_url in web_paper.feed_urls():
    print(feed_url)

5.提取源品牌和描述

print(web_paper.brand)  # 品牌
print(web_paper.description) # 描述
print("一共获取%s篇文章" % web_paper.size())  # 文章的数目

6.下载文章

from  newspaper import article
article = article("http://www.sol.com.cn/", language='zh')  # chinese
article.download()

7.解析文章并提取想要的信息

article.parse()  #网页解析
print("title=",article.title)    # 获取文章标题
print("author=", article.authors)   # 获取文章作者
print("publish_date=", article.publish_date)   # 获取文章日期
print("top_iamge=",article.top_image)   # 获取文章顶部图片地址
print("movies=",article.movies)   # 获取文章视频链接
print("text=",article.text,"\n")     # 获取文章正文
article.nlp()
print('keywords=',article.keywords)#从文本中提取关键字
print("summary=",article.summary)# 获取文章摘要
print("images=",article.images)#从html中提取所有图像
print("imgs=",article.imgs)
print("html=",article.html)#获取html

简单例子:

import newspaper
from newspaper import article

def spider_newspaper_url(url):
    """
    默认情况下，newspaper缓存所有以前提取的文章，并删除它已经提取的任何文章。
    使用memoize_articles参数选择退出此功能。
    """
    web_paper = newspaper.build(url, language="zh", memoize_articles=false)
    print("提取新闻页面的url！！！")
    for article in web_paper.articles:
    # 获取新闻网页的url
        print("新闻页面url:", article.url)
# 调用spider_newspaper_information函数获取新闻网页数据
        spider_newspaper_information(article.url)

    print("一共获取%s篇文章" % web_paper.size())  # 文章的数目

# 获取文章的信息
def spider_newspaper_information(url):
    # 建立链接和下载文章
    article = article(url, language='zh')  # chinese
    article.download()
    article.parse()

# 获取文章的信息
    print("title=", article.title)  # 获取文章标题
    print("author=", article.authors)  # 获取文章作者
    print("publish_date=", article.publish_date)  # 获取文章日期
    # print("top_iamge=", article.top_image)  # 获取文章顶部图片地址
    # print("movies=", article.movies)  # 获取文章视频链接
    print("text=", article.text, "\n")  # 获取文章正文
    print("summary=", article.summary)  # 获取文章摘要


if __name__ == "__main__":
    web_lists = ["http://www.sxdi.gov.cn/gzdt/jlsc/","http://www.people.com.cn/gb/59476/"]
    for web_list in web_lists:
        spider_newspaper_url(web_list)

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

有趣的Python OpenCV教程学习（上）

1.读取图像转化成灰度图。import cv2import numpy as npfrom matplotlib... [阅读全文]
python爬虫（2），爬取中国大学信息。

中国大学哪家好，python爬虫都知晓。简单实例。获取中国大学排名等信息，爬取并写入excel。 ... [阅读全文]
python 序列类型4(字典Dictionary）

字典Dictionary字典是一种可变容器，可存储任意数据对象字典中存在键值对key->value 键值对... [阅读全文]
Python之数据分析（中位数、波动范围、极差、离差、方差、标准差）

文章目录一、中位数二、波动范围与极差三、离差、方差与标准差一、中位数1、中位数将多个样本按照大小顺序排列，居于中... [阅读全文]
python一些排序算法-冒泡，快排

1.冒泡排序冒泡排序从列表的开头处开始，逐个比较相邻两个数据，如果前面的值大于后面的值，交换两个数据的位置,一直... [阅读全文]
python卷积神经网络入门

注：东北大学理学院大三暑期实训（第四天）卷积神经网络案例：识别数字Mnist数据集图像输入：图片尺寸越大，全连接... [阅读全文]
python如何将两个索引相同的列表的数据合并成一个新列表及绘图注意事项

在进行测试的过程中，我们的样品数量有很多，但是不同的样品测试完就形成了一个独立的文件，不同样品会形成不同的文件，... [阅读全文]
Python中常见的特殊方法-魔术方法介绍

class Vector2d: ...: typecode = 'd' ...: d... [阅读全文]
使用python实现视频与图片相互转换

申明：在做图像处理的时候需要使用到将顺序标号的图片专为视频或者将视频转换为图片的功能，现在总结出来与大家一起分享... [阅读全文]
汽车用户消费投诉数据爬取分析（Python爬虫）

"""name:汽车用户消费投诉_品牌url爬取，已完成author:xiaoyu"""import rando... [阅读全文]

网友评论


验证码：

使用Newspaper3k框架快速抓取文章信息

2019年10月15日 | 移动技术网IT编程 | 我要评论

一、框架介绍

https://github.com/codelucas/newspaper

安装方法：

二、功能

多线程文章下载框架

新闻网址识别

从html中提取文本

从html中提取顶部图像

从html中提取所有图像

从文本中提取关键字

从文本中提取摘要

从文本中提取作者

google趋势术语提取。

使用10种以上语言（英语，中文，德语，阿拉伯语……）