当前位置：移动技术网 > IT编程>脚本编程>Python > Scrapy-Redis结合POST请求获取数据的方法示例

Scrapy-Redis结合POST请求获取数据的方法示例

2019年06月04日 | 移动技术网IT编程 | 我要评论

大灾建,me863,陈仓区人民政府

前言

通常我们在一个站站点进行采集的时候，如果是小站的话我们使用scrapy本身就可以满足。

但是如果在面对一些比较大型的站点的时候，单个scrapy就显得力不从心了。

要是我们能够多个scrapy一起采集该多好啊人多力量大。

很遗憾scrapy官方并不支持多个同时采集一个站点，虽然官方给出一个方法：

**将一个站点的分割成几部分交给不同的scrapy去采集**

似乎是个解决办法，但是很麻烦诶！毕竟分割很麻烦的哇

下面就改轮到我们的额主角scrapy-redis登场了！

能看到这篇文章的小伙伴肯定已经知道什么是scrapy以及scrapy-redis了，基础概念这里就不再介绍。默认情况下scrapy-redis是发送get请求获取数据的，对于某些使用post请求的情况需要重写make_request_from_data函数即可，但奇怪的是居然没在网上搜到简洁明了的答案，或许是太简单了？。

这里我以httpbin.org这个网站为例，首先在settings.py中添加所需配置，这里需要根据实际情况进行修改：

scheduler = "scrapy_redis.scheduler.scheduler" #启用redis调度存储请求队列
scheduler_persist = true #不清除redis队列、这样可以暂停/恢复 爬取
dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter" #确保所有的爬虫通过redis去重
scheduler_queue_class = 'scrapy_redis.queue.spiderpriorityqueue'
redis_url = "redis://127.0.0.1:6379"

爬虫代码如下：

# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import redisspider


class hpbspider(redisspider):
 name = 'hpb'
 redis_key = 'test_post_data'

 def make_request_from_data(self, data):
  """returns a request instance from data coming from redis.
  by default, ``data`` is an encoded url. you can override this method to
  provide your own message decoding.
  parameters
  ----------
  data : bytes
   message from redis.
  """
  return scrapy.formrequest("https://www.httpbin.org/post",
         formdata={"data":data},callback=self.parse)

 def parse(self, response):
  print(response.body)

这里为了简单直接进行输出，真实使用时可以结合pipeline写数据库等。

然后启动爬虫程序scrapy crawl hpb，由于我们还没向test_post_data中写数据，所以启动后程序进入等待状态。然后模拟向队列写数据：

import redis
rd = redis.redis('127.0.0.1',port=6379,db=0)
for _ in range(1000):
 rd.lpush('test_post_data',_)

此时可以看到爬虫已经开始获取程序了：

2019-05-06 16:30:21 [hpb] debug: read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "0"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "1"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "3"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "2"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "4"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "5"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "6"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "7"\n }, \n "headers": {\n    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "accept-encoding": "gzip,deflate", \n    "accept-language": "en", \n    "content-length": "6", \n    "content-type": "application/x-www-form-urlencoded", \n    "host": "", \n    "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
2019-05-06 16:31:09 [scrapy.extensions.logstats] info: crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] info: crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] info: crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

至于数据重复的问题，如果post的数据重复，这个请求就不会发送出去。如果有特殊情况post发送同样的数据回得到不同返回值，添加dont_filter=true是没用的，在rfpdupefilter类中并没考虑这个参数，需要重写。

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家对移动技术网的支持。

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python如何查看网页代码

用python查看网页代码的方法：1、使用“import”导入requests包import requests2、使用requests包的get()函数通过网页... [阅读全文]
Python如何用wx模块创建文本编辑器

用python的wx模块创建文本编辑器的方法：1、设置按钮的位置import wxapp = wx.app()win = wx.frame(none,title... [阅读全文]
python如何保存文本文件

python保存文本文件的方法：使用python内置的open()类可以打开文本文件，向文件里面写入数据可以用write()函数，写完之后，使用close()函... [阅读全文]
python如何编写win程序

python可以编写win程序。win程序的格式是exe，下面我们就来看一下使用python编写exe程序的方法。编写好python程序后py2exe模块即可将... [阅读全文]
Python替换NumPy数组中大于某个值的所有元素实例

我有一个2d(二维) numpy数组，并希望用255.0替换大于或等于阈值t的所有值。据我所知，最基础的方法是：shape = arr.shaperesult ... [阅读全文]
使用Numpy对特征中的异常值进行替换及条件替换方式

原始数据为excel文件，由传感器获得，通过pyhton xlrd模块读入，读入后为数组形式，由于其存在部分异常值和缺失值，所以便利用numpy对其中的异常值进... [阅读全文]
Python 实现将numpy中的nan和inf,nan替换成对应的均值

nan：not a numberinf：infinity;正无穷numpy中的nan和inf都是float类型t!=t 返回bool类型的数组(矩阵)np.co... [阅读全文]
给ubuntu18安装python3.7的详细教程

参考文章准备工作安装工具sudo apt updatesudo apt upgradesudo apt install gccsudo apt install ... [阅读全文]
python爬虫把url链接编码成gbk2312格式过程解析

1. 问题　　抓取某个网站，发现请求参数是乱码格式，这是点击 textview，发现请求参数如下图所示3. 那么=%b9%fa%ce%f1%d4%ba%b7%a... [阅读全文]
pyecharts在数据可视化中的应用详解

使用pyecharts进行数据可视化安装 pip install pyecharts也可以在pycharm软件里进行下载pyecharts库包。下载成功后进行查... [阅读全文]

网友评论


验证码：

Scrapy-Redis结合POST请求获取数据的方法示例

2019年06月04日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论