当前位置：移动技术网 > IT编程>脚本编程>Python > Python+Scrapy爬取腾讯新闻首页所有新闻及评论

Python+Scrapy爬取腾讯新闻首页所有新闻及评论

2018年04月26日 | 移动技术网IT编程 | 我要评论

武林外传璀璨星辰,仙逆全文阅读,辛丕宏

前言

这篇博客写的是实现的一个爬取腾讯新闻首页所有的新闻及其所有评论的爬虫。选用Python的Scrapy框架。这篇文章主要讨论使用Chrome浏览器的开发者工具获取新闻及评论的来源地址。

Chrome的开发者工具（或Firefox的web控制台）是个很有用的工具，你可以通过它清楚的看到你在访问一个网站的过程中浏览器发送了哪些信息，接收了哪些信息。而在我们编写爬虫的时候，就需要知道我们需要爬取的内容来自哪里，来自哪个链接。

正文

腾讯新闻首页上的新闻有三种链接格式

一种是：https://news.qq.com/a/time/newsID.htm

如：https://news.qq.com/a/20180414/010445.htm

一种是：

如：

一种是：

如：

其中：
time:新闻发布日期，第三种新闻链接没有这个值。
newsID:新闻页面的ID，第一种新闻的ID只包含数字，后两种包含数字和字母

这三种格式的新闻链接都能在腾讯新闻首页的源代码中得到，如图：

得到了新闻页面之后，接下来是得到新闻的正文，前面两种新闻的正文及其他信息可以直接在页面的源代码中获得。第三种就比较麻烦了，下文会讲到。另外还要通过新闻页面得到评论页面。

三种格式的新闻的评论页链接的格式是相同的
都为http://coral.qq.com/cmtid
如：http://coral.qq.com/2572597712
其中的cmtid为一串数字，标识每一条新闻的评论页面。我们需要在新闻页面中找到这个值，前面两种新闻比较方便，cmtid以及其他新闻信息都在页面源码中，但是第三种新闻就不同了，页面源码中没有我们想要的东西。
这时候就要使用开发者工具来得到第三种新闻评论页的cmtid以及新闻正文。
在第三种新闻的新闻页面http://new.qq.com/omn/newsid，按F12（或右键->检查）调出开发者工具，点击network，F5快捷键刷新。如图：

然后在找到包含所需要信息的地址。如图：

在Headers栏查看地址，如图：

可以得到第三种新闻的cmtid以及正文信息通过这个地址返回：
http://openapi.inews.qq.com/getQQNewsNormalContent?id=newsid&chlid=news_rss&refer=mobilewwwqqcom&otype=jsonp&ext_data=all&srcfrom=newsapp&callback=getNewsContentOnlyOutput
如：http://openapi.inews.qq.com/getQQNewsNormalContent?id=20180414A000MX00&chlid=news_rss&refer=mobilewwwqqcom&otype=jsonp&ext_data=all&srcfrom=newsapp&callback=getNewsContentOnlyOutput

其中newsid就是新闻的id，我们可以通过这个链接得到cmtid、正文内容。

在得到了新闻的cmtid后，然后就要分析得到评论信息的来源地址了
在评论页http://coral.qq.com/cmtid调出开发者工具，刷新得到返回的信息。如图：

在Headers栏查看地址，如图：

可以得到评论通过下面这个地址返回
http://coral.qq.com/article/2530433473/comment/v2?callback=_articlecommentv2&orinum=10&oriorder=t&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=1&_=1522383466213
其中
2530433473：表示评论页ID。
orinum=10：表示返回评论的数目为10，这个值最大为30，也就是一个页面最多返回30个评论。
oriorder=t：表示返回的评论按时间排序 ,o表示按热度排序
orirepnum=2：表示每条评论的回复评论数最多为2，也就是楼中楼最多两层
cursor=0：起始值为0，之后根据返回页面中last的值，得到下一个评论页面。
reporder=t：同oriorder=t。

以上这些值可以根据自己的需求更改，其他的无需更改。其中为了得到所有评论，需要不断更改cursor的值，该值可以通过返回的评论页中last的值更新。
以上就是数据来源地址的获取，接下来就是爬虫的具体实现了。

爬虫的具体实现
该爬虫分为两个模块，模块一是爬取新闻首页所有的新闻，获取所有新闻的正文，新闻id、评论页id等信息。
模块二是根据获取的新闻id、评论页id，逐个爬取每个新闻的所有评论。
模块一的主要代码

# -*- coding: utf-8 -*-
from scrapy.spiders import Spider  
from scrapy.http import Request  
from scrapy.selector import Selector  
from test1.items import NewsItem,ListCombiner
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import requests
import re
import json

class TencentNewsSpider(CrawlSpider):
    name = 'tencent_news_spider'
    allowed_domains = ['new.qq.com','news.qq.com']
    start_urls = [
        'http://news.qq.com'
    ]
    url_pattern1= r'(.*)/a/(\d{8})/(\d+)\.htm'
    url_pattern2=r'(.*)/omn/(.+)\.html'
    url_pattern3=r'(.*)/omn/([A-Z0-9]{16,19})'
    url_pattern4=r'(.*)/omn/(\d{8})/(.+)\.html'
    rules = (
        Rule(LinkExtractor(allow=(url_pattern1)),'parse_news1'),
        Rule(LinkExtractor(allow=(url_pattern2)),'parse_news2'),
        Rule(LinkExtractor(allow=(url_pattern3)),'parse_news3'),
    )

    def parse_news1(self, response):
        sel = Selector(response)
        print(response.url)
        pattern = re.match(self.url_pattern1, str(response.url))
        item = NewsItem()
        item['source'] = 'tencent'#pattern.group(1)
        item['date'] = pattern.group(2)
        item['newsId'] = pattern.group(3)
        item['cmtId'] = (sel.re(r"cmt_id = (.*);"))[0] # unicode string需要判断有没有cmtId，因为页面有可能为空
        item['comments'] = {'link':str('http://coral.qq.com/')+item['cmtId']}
        item['contents'] = {'link':str(response.url), 'title':u'', 'passage':u''}
        item['contents']['title'] = sel.xpath('//h1/text()').extract()[0]
        item['contents']['passage'] = ListCombiner(sel.xpath('//p/text()').extract())
        return item


    def parse_news2(self,response):
        sel = Selector(response)
        pattern = re.match(self.url_pattern4, str(response.url))
        item=NewsItem()
        item['source'] = 'tencent'#pattern.group(1)
        item['date'] = pattern.group(2)
        item['newsId'] = pattern.group(3)
        item['cmtId'] = (sel.re(r"\"comment_id\":\"(\d*)\","))[0]
        item['comments'] = {'link':str('http://coral.qq.com/')+item['cmtId']}
        item['contents'] = {'link':str(response.url), 'title':u'', 'passage':u''}
        item['contents']['title'] = sel.xpath('//h1/text()').extract()[0]
        item['contents']['passage'] = ListCombiner(sel.xpath('//p/text()').extract())
        return item

    def parse_news3(self,response):
        item = NewsItem()
        print(response.url)
        str1='http://openapi.inews.qq.com/getQQNewsNormalContent?id='
        str2='&chlid=news_rss&refer=mobilewwwqqcom&otype=jsonp&ext_data=all&srcfrom=newsapp&callback=getNewsContentOnlyOutput'
        pattern = re.match(self.url_pattern3, str(response.url))
        date=re.search(r"(\d{8})",pattern.group(2))#匹配时间
        item['source'] = 'tencent'#pattern.group(1)
        item['date'] = date.group(0)
        item['newsId'] = pattern.group(2)
        print(pattern.group(2))
        headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
        }
        out=self.getHTMLText(str1+pattern.group(2)+str2,headers)
        g=re.search("getNewsContentOnlyOutput\\((.+)\\)", out)
        out=json.loads(g.group(1))
        item['cmtId'] =out["cid"]
        item['comments'] = {'link':str('http://coral.qq.com/')+item['cmtId']}
        item['contents'] = {'link':str(response.url), 'title':u'', 'passage':u''}
        item['contents']['title'] = out["title"]
        item['contents']['passage'] =out["ext_data"]["cnt_html"]
        return item

    def getHTMLText(self,url,headers):
        try:
            r=requests.get(url, headers=headers)
            r.raise_for_status()
            r.encoding=r.apparent_encoding
            return r.text
        except:
            print("产生异常")

模块二的主要代码
以下是爬取评论的函数：

# -*- coding: utf-8 -*-
import requests
import re
import json
import codecs
import os
import datetime

# 爬取新闻评论id为commentid，日期为date，新闻id为newsID的所有评论
def crawlcomment(commentid,date,newsID):
    url1='http://coral.qq.com/article/'+commentid+'/comment/v2?callback=_articlecommentv2&orinum=30&oriorder=t&pageflag=1&cursor='
    url2='&orirepnum=10&_=1522383466213'
    # 一定要加头要不然无法访问
    headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
    }

    dir=os.getcwd();
    comments_file_path=dir+'/docs/tencent/' + date+'/'+newsID+'_comments.json'

    news_file = codecs.open(comments_file_path, 'a', 'utf-8')
    response=getHTMLText(url1+'0'+url2,headers)

    while 1:
        g=re.search("_articlecommentv2\\((.+)\\)", response)
        out=json.loads(g.group(1))
        if not out["data"]["last"]:
            news_file.close()
            print("finish！")
            break;
        for i in out["data"]["oriCommList"]:
            time=str(datetime.datetime.fromtimestamp(int(i["time"])))#将unix时间戳转化为正常时间
            line = json.dumps(time+':'+i["content"],ensure_ascii=False)+'\n'
            news_file.write(line)
            print(i["content"])

        url=url1+out["data"]["last"]+url2#得到下一个评论页面链接
        print(url)
        response=getHTMLText(url,headers)

def getHTMLText(url,headers):
    try:
        r=requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "产生异常 "

效果截图

爬取了132条新闻，如图：

新闻正文，如图：

新闻评论，如图：

可到我的github获取所有代码：

https://github.com/Hahallo/CrawlTencentNewsComments

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

新手学习Python2和Python3中print不同的用法

在python2和python3中都提供print()方法来打印信息,但两个版本间的print稍微有差异主要体现在以下几个方面：1.python3中print是... [阅读全文]
Python基于os.environ从windows获取环境变量

安装python之后，我们往往面临这样一个问题，在命令行输入“python”，竟然出错，难道是没有安装成功吗？非也，其实是你的系统环境变量没有设置好。今天，小编... [阅读全文]
keras实现调用自己训练的模型,并去掉全连接层

其实很简单from keras.models import load_modelbase_model = load_model('model_resenet.h... [阅读全文]
python中def是做什么的

python使用def开始函数定义，紧接着是函数名，括号内部为函数的参数，内部为函数的具体功能实现代码，如果想要函数有返回值, 在expressions中的逻... [阅读全文]
Python xlwt模块使用代码实例

简介写入excle文档安装：pip3 install xlwt导入：import xlwtxlrd 模块方法写入案例import xlwt# 创建对象，设置编码... [阅读全文]
Keras之自定义损失(loss)函数用法说明

在keras中可以自定义损失函数，在自定义损失函数的过程中需要注意的一点是，损失函数的参数形式，这一点在keras中是固定的，须如下形式：def my_loss... [阅读全文]
Python xlrd模块导入过程及常用操作

简介读取excle文档，支持xls，xlsx格式安装：pip3 install xlrd导入：import xlrdxlrd 模块方法读取excelfile =... [阅读全文]
keras打印loss对权重的导数方式

notes怀疑模型梯度爆炸，想打印模型 loss 对各权重的导数看看。如果如果fit来训练的话，可以用keras.callbacks.tensorboard实现... [阅读全文]
keras 使用Lambda 快速新建层添加多个参数操作

keras许多简单操作，都需要新建一个层，使用lambda可以很好完成需求。# 额外参数def normal_reshape(x, shape): return... [阅读全文]
JAVA及PYTHON质数计算代码对比解析

java 实现class primenumber{public static void main(string[] args) {long start=syst... [阅读全文]

网友评论


验证码：

Python+Scrapy爬取腾讯新闻首页所有新闻及评论

2018年04月26日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论