当前位置：移动技术网 > IT编程>脚本编程>Python > Scrapy爬取博客园精华区内容

Scrapy爬取博客园精华区内容

2018年12月01日 | 移动技术网IT编程 | 我要评论

程序爬取目标

获取博客园精华区文章的标题、标题链接、作者、作者博客主页链接、摘要、发布时间、评论数、阅读数和推荐数，并存储到mongodb中。

程序环境

已安装scrapy
已安装mongodb

创建工程

scrapy startproject cnblogs

在命令提示符中执行上述命令后，会建立一个名为cnblogs的文件夹。

创建爬虫文件

cd cnblogs
scrapy genspider cn cnblogs.com

执行上述命令后，会在cnblogs\spiders\下新建一个名为cn.py的爬虫文件，cnblogs.com为允许爬取的域名。

编写items.py文件

定义需要爬取的内容。

import scrapy

class cnblogsitem(scrapy.item):
    # define the fields for your item here like:
    post_author = scrapy.field()    #发布作者
    author_link = scrapy.field()    #作者博客主页链接
    post_date = scrapy.field()      #发布时间
    digg_num = scrapy.field()       #推荐数
    title = scrapy.field()          #标题
    title_link = scrapy.field()     #标题链接
    item_summary = scrapy.field()   #摘要
    comment_num = scrapy.field()    #评论数
    view_num = scrapy.field()       #阅读数

编写爬虫文件cn.py

import scrapy
from cnblogs.items import cnblogsitem

class cnspider(scrapy.spider):
    name = 'cn'
    allowed_domains = ['cnblogs.com']
    start_urls = ['https://www.cnblogs.com/pick/']

    def parse(self, response):
        div_list = response.xpath("//div[@id='post_list']/div")
        for div in div_list:
            item = cnblogsitem()
            item["post_author"] = div.xpath(".//div[@class='post_item_foot']/a/text()").extract_first()
            item["author_link"] = div.xpath(".//div[@class='post_item_foot']/a/@href").extract_first()
            item["post_date"] = div.xpath(".//div[@class='post_item_foot']/text()").extract()
            item["comment_num"] = div.xpath(".//span[@class='article_comment']/a/text()").extract_first()
            item["view_num"] = div.xpath(".//span[@class='article_view']/a/text()").extract_first()
            item["title"] = div.xpath(".//h3/a/text()").extract_first()
            item["title_link"] = div.xpath(".//h3/a/@href").extract_first()
            item["item_summary"] = div.xpath(".//p[@class='post_item_summary']/text()").extract()
            item["digg_num"] = div.xpath(".//span[@class='diggnum']/text()").extract_first()
            yield item

        next_url = response.xpath(".//a[text()='next >']/@href").extract_first()
        if next_url is not none:
            next_url = "https://www.cnblogs.com" + next_url
            yield scrapy.request(
                next_url,
                callback=self.parse
            )

编写pipelines.py文件

对抓取到的数据进行简单处理，去除无效的字符串，并保存到mongodb中。

from pymongo import mongoclient
import re

client = mongoclient()
collection = client["test"]["cnblogs"]

class cnblogspipeline(object):
    def process_item(self, item, spider):
        item["post_date"] = self.process_string_list(item["post_date"])
        item["comment_num"] = self.process_string(item["comment_num"])
        item["item_summary"] = self.process_string_list(item["item_summary"])
        print(item)
        collection.insert(dict(item))
        return item

    def process_string(self,content_string):
        if content_string is not none:
            content_string = re.sub(" |\s","",content_string)
        return content_string

    def process_string_list(self,string_list):
        if string_list is not none:
            string_list = [re.sub(" |\s","",i) for i in string_list]
            string_list = [i for i in string_list if len(i) > 0][0]
        return string_list

修改settings.py文件

添加user_agent

user_agent = 'mozilla/5.0 (windows nt 6.1; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/70.0.3538.102 safari/537.36'

启用pipelines

item_pipelines = {
   'cnblogs.pipelines.cnblogspipeline': 300,
}

运行程序

执行下面的命令，开始运行程序。

scrapy crawl cn

程序运行结果

程序运行结束后，mongodb中的数据如下图所示，采用的可视化工具是robo 3t。

感谢大家的阅读，如果文中有不正确的地方，希望大家指出，我会积极地学习、改正。
再次感谢您耐心的读完本篇文章。

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

LTE信令流程——去附着

LTE信令流程去附着去附着流程允许UE通知网络侧，UE不想再进入EPS，或是网络侧通知UE不允许再进入EPS网络... [阅读全文]
一对一直播源码的市场发展，能否开启一个直播的新时代

现代用户的需求一直都在变化中，单单靠传统的一对多的直播模式，已经满足不了用户的需求，多以手机APP经过这几年的沉... [阅读全文]
老猿学5G：融合计费场景的离线计费会话的Nchf_OfflineOnlyCharging_Release释放操作

☞ ░ 前往老猿Python博文目录 ░一、Nchf_OfflineOnlyCharging_Release消... [阅读全文]
5G天线介绍

一、天线原理及指标1、天线的定义1）半波振子半波振子是天线的基本辐射单元，波长越长，天线半波振子越大2、天线的辐... [阅读全文]
全面了解APON,BPON,EPON,GPON

PON(Passive Optical Network)是无源光网络，指在OLT（光线路终端）和ONU（光网络... [阅读全文]
python for 循环CPU满载

for循环多层会增大CPU负荷，CPU会爆表，风扇狂飙。闲着没啥事，用个破7代i5并且4G运存的PC试一试电脑会... [阅读全文]
老猿学5G扫盲贴：中国移动5G融合计费漫游计费架构和路由方案

专栏：Python基础教程目录专栏：使用PyQt开发图形界面Python应用专栏：PyQt+moviepy音视频... [阅读全文]
老猿学5G扫盲贴：与用户和终端相关的名词UE、SUPI、GPSI、PEI

专栏：Python基础教程目录专栏：使用PyQt开发图形界面Python应用专栏：PyQt+moviepy音视频... [阅读全文]
三星以核心技术优势持续更新折叠手机，华为缺乏自主技术无力应对

媒体报道指三星将在8月份发布galaxy fold2，进一步升级折叠屏技术，这已是它第三代折叠手机。相比之下，此... [阅读全文]
基于OpenCV+Python的均值滤波，高斯滤波，中值滤波，双边滤波

图像平滑可采用：均值滤波，高斯滤波，中值滤波，双边滤波来达到我们想要的效果import cv2import nu... [阅读全文]