当前位置：移动技术网 > IT编程>脚本编程>Python > python爬虫--分布式爬虫

python爬虫--分布式爬虫

2019年12月16日 | 移动技术网IT编程 | 我要评论

杜罗西,冰蓝公主裙,嘘嘘乐官网

scrapy-redis分布式爬虫

介绍

scrapy-redis巧妙的利用redis 实现 request queue和 items queue，利用redis的set实现request的去重，将scrapy从单台机器扩展多台机器，实现较大规模的爬虫集群

scrapy-redis是基于redis的scrapy组件
• 分布式爬虫
    多个爬虫实例分享一个redis request队列，非常适合大范围多域名的爬虫集群
• 分布式后处理
    爬虫抓取到的items push到一个redis items队列,这就意味着可以开启多个items processes来处理抓取到的数据，比如存储到mongodb、mysql
• 基于scrapy即插即用组件
    scheduler + duplication filter, item pipeline, base spiders.

scrapy-redis架构

• 调度器(scheduler)

scrapy-redis调度器通过redis的set不重复的特性，实现了duplication filter去重（dupefilter set存放爬取过的request）。
spider新生成的request，将request的指纹到redis的dupefilter set检查是否重复，并将不重复的request push写入redis的request队列。
调度器每次从redis的request队列里根据优先级pop出一个request, 将此request发给spider处理。

• item pipeline

将spider爬取到的item给scrapy-redis的item pipeline，将爬取到的item存入redis的items队列。可以很方便的从items队列中提取item，从而实现items processes 集群

scrapy - redis安装与使用

安装scrapy-redis

之前已经装过scrapy了，这里直接装scrapy-redis

pip install scrapy-redis

使用scrapy-redis的example来修改

先从github上拿到scrapy-redis的example，然后将里面的example-project目录移到指定的地址

git clone https://github.com/rolando/scrapy-redis.git
cp -r scrapy-redis/example-project ./scrapy-youyuan

或者将整个项目下载回来scrapy-redis-master.zip解压后

cp -r scrapy-redis-master/example-project/ ./redis-youyuan
cd redis-youyuan/

tree查看项目目录

修改settings.py

注意：settings里面的中文注释会报错，换成英文

# 指定使用scrapy-redis的scheduler
scheduler = "scrapy_redis.scheduler.scheduler"

# 在redis中保持scrapy-redis用到的各个队列，从而允许暂停和暂停后恢复
scheduler_persist = true

# 指定排序爬取地址时使用的队列，默认是按照优先级排序
scheduler_queue_class = 'scrapy_redis.queue.spiderpriorityqueue'
# 可选的先进先出排序
# scheduler_queue_class = 'scrapy_redis.queue.spiderqueue'
# 可选的后进先出排序
# scheduler_queue_class = 'scrapy_redis.queue.spiderstack'

# 只在使用spiderqueue或者spiderstack是有效的参数,，指定爬虫关闭的最大空闲时间
scheduler_idle_before_close = 10

# 指定redispipeline用以在redis中保存item
item_pipelines = {
    'example.pipelines.examplepipeline': 300,
    'scrapy_redis.pipelines.redispipeline': 400
}

# 指定redis的连接参数
# redis_pass是我自己加上的redis连接密码，需要简单修改scrapy-redis的源代码以支持使用密码连接redis
redis_host = '127.0.0.1'
redis_port = 6379
# custom redis client parameters (i.e.: socket timeout, etc.)
redis_params  = {}
#redis_url = 'redis://user:pass@hostname:9001'
#redis_params['password'] = 'itcast.cn'
log_level = 'debug'

dupefilter_class = 'scrapy.dupefilters.rfpdupefilter'

#the class used to detect and filter duplicate requests.

#the default (rfpdupefilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. in order to change the way duplicates are checked you could subclass rfpdupefilter and override its request_fingerprint method. this method should accept scrapy request object and return its fingerprint (a string).

#by default, rfpdupefilter only logs the first duplicate request. setting dupefilter_debug to true will make it log all duplicate requests.
dupefilter_debug =true

# override the default request headers:
default_request_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'accept-language': 'zh-cn,zh;q=0.8',
    'connection': 'keep-alive',
    'accept-encoding': 'gzip, deflate, sdch',
}

查看pipeline.py

from datetime import datetime

class examplepipeline(object):
    def process_item(self, item, spider):
        item["crawled"] = datetime.utcnow()
        item["spider"] = spider.name
        return item

流程

    - 概念：可以使用多台电脑组件一个分布式机群，让其执行同一组程序，对同一组网络资源进行联合爬取。
    - 原生的scrapy是无法实现分布式
        - 调度器无法被共享
        - 管道无法被共享
    - 基于scrapy+redis（scrapy&scrapy-redis组件）实现分布式
    - scrapy-redis组件作用：
        - 提供可被共享的管道和调度器
    - 环境安装：
        - pip install scrapy-redis
    - 编码流程：
        1.创建工程
        2.cd proname
        3.创建crawlspider的爬虫文件
        4.修改一下爬虫类：
            - 导包：from scrapy_redis.spiders import rediscrawlspider
            - 修改当前爬虫类的父类：rediscrawlspider
            - allowed_domains和start_urls删除
            - 添加一个新属性：redis_key = 'xxxx'可以被共享的调度器队列的名称
        5.修改配置settings.py
            - 指定管道
                item_pipelines = {
                        'scrapy_redis.pipelines.redispipeline': 400
                    }
            - 指定调度器
                # 增加了一个去重容器类的配置, 作用使用redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
                dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter"
                # 使用scrapy-redis组件自己的调度器
                scheduler = "scrapy_redis.scheduler.scheduler"
                # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空redis中请求队列和去重指纹的set。如果是true, 就表示要持久化存储, 就不清空数据, 否则清空数据
                scheduler_persist = true
            - 指定redis数据库
                redis_host = 'redis服务的ip地址'
                redis_port = 6379
         6.配置redis数据库（redis.windows.conf）
            - 关闭默认绑定
                - 56line：#bind 127.0.0.1
            - 关闭保护模式
                - 75line：protected-mode no
         7.启动redis服务（携带配置文件）和客户端
            - redis-server.exe redis.windows.conf
            - redis-cli
         8.执行工程
            - scrapy runspider spider.py
         9.将起始的url仍入到可以被共享的调度器的队列（sun）中
            - 在redis-cli中操作：lpush sun www.xxx.com
         10.redis:
            - xxx:items:存储的就是爬取到的数据

分布式爬取案例

爬虫程序

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import linkextractor
from scrapy.spiders import crawlspider, rule
from scrapy_redis.spiders import rediscrawlspider
from fbs.items import fbsproitem

class fbsspider(rediscrawlspider):
    name = 'fbs_obj'
    # allowed_domains = ['www.xxx.com']
    # start_urls = ['http://www.xxx.com/']
    redis_key = 'sun'#可以被共享的调度器队列的名称
    link = linkextractor(allow=r'type=4&page=\d+')
    rules = (
        rule(link, callback='parse_item', follow=true),
    )
    print(123)
    def parse_item(self, response):
        tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            title = tr.xpath('./td[2]/a[2]/@title').extract_first()
            status = tr.xpath('./td[3]/span/text()').extract_first()

            item = fbsproitem()
            item['title'] = title
            item['status'] = status
            print(title)
            yield item

settings.py

# -*- coding: utf-8 -*-

# scrapy settings for fbspro project
#
# for simplicity, this file contains only settings considered important or
# commonly used. you can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

bot_name = 'fbs_obj'

spider_modules = ['fbs_obj.spiders']
newspider_module = 'fbs_obj.spiders'


# crawl responsibly by identifying yourself (and your website) on the user-agent
#user_agent = 'fbspro (+http://www.yourdomain.com)'
user_agent = 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/78.0.3904.97 safari/537.36'

# obey robots.txt rules
robotstxt_obey = false

# configure maximum concurrent requests performed by scrapy (default: 16)
concurrent_requests = 2

# configure a delay for requests for the same website (default: 0)
# see https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# see also autothrottle settings and docs
#download_delay = 3
# the download delay setting will honor only one of:
#concurrent_requests_per_domain = 16
#concurrent_requests_per_ip = 16

# disable cookies (enabled by default)
#cookies_enabled = false

# disable telnet console (enabled by default)
#telnetconsole_enabled = false

# override the default request headers:
#default_request_headers = {
#   'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'accept-language': 'en',
#}

# enable or disable spider middlewares
# see https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#spider_middlewares = {
#    'fbspro.middlewares.fbsprospidermiddleware': 543,
#}

# enable or disable downloader middlewares
# see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#downloader_middlewares = {
#    'fbspro.middlewares.fbsprodownloadermiddleware': 543,
#}

# enable or disable extensions
# see https://docs.scrapy.org/en/latest/topics/extensions.html
#extensions = {
#    'scrapy.extensions.telnet.telnetconsole': none,
#}

# configure item pipelines
# see https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#item_pipelines = {
#    'fbspro.pipelines.fbspropipeline': 300,
#}

# enable and configure the autothrottle extension (disabled by default)
# see https://docs.scrapy.org/en/latest/topics/autothrottle.html
#autothrottle_enabled = true
# the initial download delay
#autothrottle_start_delay = 5
# the maximum download delay to be set in case of high latencies
#autothrottle_max_delay = 60
# the average number of requests scrapy should be sending in parallel to
# each remote server
#autothrottle_target_concurrency = 1.0
# enable showing throttling stats for every response received:
#autothrottle_debug = false

# enable and configure http caching (disabled by default)
# see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#httpcache_enabled = true
#httpcache_expiration_secs = 0
#httpcache_dir = 'httpcache'
#httpcache_ignore_http_codes = []
#httpcache_storage = 'scrapy.extensions.httpcache.filesystemcachestorage'

#指定管道
item_pipelines = {
    'scrapy_redis.pipelines.redispipeline': 400
}
#指定调度器
# 增加了一个去重容器类的配置, 作用使用redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter"
# 使用scrapy-redis组件自己的调度器
scheduler = "scrapy_redis.scheduler.scheduler"
# 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空redis中请求队列和去重指纹的set。如果是true, 就表示要持久化存储, 就不清空数据, 否则清空数据
scheduler_persist = true

#指定redis
redis_host = '192.168.16.119'
redis_port = 6379

item.py

import scrapy

class fbsproitem(scrapy.item):
    # define the fields for your item here like:
    title = scrapy.field()
    status = scrapy.field()

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python dict乱码如何解决

定义字典并直接输出，结果输出结果中文是乱码展示d={'name':'lily','age':18,'sex':'女','no':1121}print d输出结果... [阅读全文]
如何写python的配置文件

一、创建配置文件在d盘建立一个配置文件，名字为：test.ini内容如下：[baseconf]host=127.0.0.1port=3306user=rootp... [阅读全文]
使用Python FastAPI构建Web服务的实现

fastapi 是一个使用 python 编写的 web 框架，还应用了 python asyncio 库中最新的优化。本文将会介绍如何搭建基于容器的开发环境，... [阅读全文]
Python过滤掉numpy.array中非nan数据实例

代码需要先导入pandasarr的数据类型为一维的np.arrayimport pandas as pdarr[~pd.isnull(arr)]补充知识：pyt... [阅读全文]
python求numpy中array按列非零元素的平均值案例

输入：numpy的array输出：一个一维的平均值arrayimport numpy as np def non_zero_mean(np_arr): exis... [阅读全文]
Python如何向SQLServer存储二进制图片

需求是需要用python往 sqlserver中的image类型字段中插入二进制图片核心代码，研究好几个小时的代码：安装pywin32，adodbapiimag... [阅读全文]
python numpy实现rolling滚动案例

相比较pandas，numpy并没有很直接的rolling方法，但是numpy 有一个技巧可以让numpy在c代码内部执行这种循环。这是通过添加一个与窗口大小相... [阅读全文]
python opencv 实现读取、显示、写入图像的方法

opencv是一个强大的图像处理和计算机视觉库，实现了很多实用算法，值得学习和深究下。opencv包安装·　　这里直接安装opencv-python包（非官方）... [阅读全文]
python thrift 实现单端口多服务的过程

thrift 是一种接口描述语言和二进制通信协议。以前也没接触过，最近有个项目需要建立自动化测试，这个项目之间的微服务都是通过 thrift 进行通信的，然后写... [阅读全文]
Python while true实现爬虫定时任务

记得以前的windows 任务定时是可以的正常使用的，今天试了下，发现不能正常使用了，任务计划总是挂起。接下来记录下python 爬虫定时任务的几种解决方法。今... [阅读全文]

网友评论


验证码：