当前位置：移动技术网 > IT编程>脚本编程>Python > 利用 scrapy-splash 对京东进行模拟点击并进行数据爬取

利用 scrapy-splash 对京东进行模拟点击并进行数据爬取

2019年06月11日 | 移动技术网IT编程 | 我要评论

少妇被骗260万,陈希同得罪了谁,非诚勿扰刘欢

本人是第一次写博客，有写得不好的地方欢迎值出来，大家一起进步！

scrapy-splash的介绍

scrapy-splash模块主要使用了. 所谓的splash, 就是一个javascript渲染服务。它是一个实现了http api的轻量级浏览器，splash是用python实现的，同时使用twisted和qt。twisted（qt）用来让服务具有异步处理能力，以发挥webkit的并发能力。splash的特点如下：

并行处理多个网页
得到html结果以及（或者）渲染成图片
关掉加载图片或使用 adblock plus规则使得渲染速度更快
使用javascript处理网页内容
使用lua脚本
能在splash-jupyter notebooks中开发splash lua scripts
能够获得具体的har格式的渲染信息

参考文档：

准备配置

scrapy框架
splash安装，windows用户通过虚拟机安装docker,linux直接安装docker

页面分析

首先进入网站搜索想要的书籍，这里以 python3.7 书籍为例子。

点击搜索后发现京东是通过 js 来加载书籍数据的，通过下来鼠标可以发现加载了更多的书籍数据（数据也可以通过京东的api来获取）

首先是模拟搜索，通过检查可得：

然后是模拟下拉，这里选择页面底部的这个元素作为模拟元素：

开始爬取

模拟点击的lua脚本并获取页数：

 1 function main(splash, args)
 2   splash.images_enabled = false
 3   splash:set_user_agent('mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/73.0.3683.103 safari/537.36')
 4   assert(splash:go(args.url))
 5   splash:wait(0.5)
 6   local input = splash:select("#keyword")
 7   input:send_text('python3.7')
 8   splash:wait(0.5)
 9   local form = splash:select('.input_submit')
10   form:click()
11   splash:wait(2)
12   splash:runjs("document.getelementsbyclassname('bottom-search')[0].scrollintoview(true)")
13   splash:wait(6)
14   return splash:html()
15 end

同上有模拟下拉的代码：

1 function main(splash, args)
2   splash.images_enabled = false
3   splash:set_user_agent('mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/36.0.1985.67 safari/537.36')
4   assert(splash:go(args.url))
5   splash:wait(2)
6   splash:runjs("document.getelementsbyclassname('bottom-search')[0].scrollintoview(true)")
7   splash:wait(6)
8   return splash:html()
9 end

选择你想要获取的元素，通过检查获得。附上源码：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy import request
 4 from scrapy_splash import splashrequest
 5 from ..items import jdsplashitem
 6 
 7 
 8 
 9 lua_script = '''
10 function main(splash, args)
11   splash.images_enabled = false
12   splash:set_user_agent('mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/73.0.3683.103 safari/537.36')
13   assert(splash:go(args.url))
14   splash:wait(0.5)
15   local input = splash:select("#keyword")
16   input:send_text('python3.7')
17   splash:wait(0.5)
18   local form = splash:select('.input_submit')
19   form:click()
20   splash:wait(2)
21   splash:runjs("document.getelementsbyclassname('bottom-search')[0].scrollintoview(true)")
22   splash:wait(6)
23   return splash:html()
24 end
25 '''
26 
27 lua_script2 = '''
28 function main(splash, args)
29   splash.images_enabled = false
30   splash:set_user_agent('mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/36.0.1985.67 safari/537.36')
31   assert(splash:go(args.url))
32   splash:wait(2)
33   splash:runjs("document.getelementsbyclassname('bottom-search')[0].scrollintoview(true)")
34   splash:wait(6)
35   return splash:html()
36 end
37 '''
38 
39 class jdbookspider(scrapy.spider):
40     name = 'jd'
41     allowed_domains = ['search.jd.com']
42     start_urls = ['https://search.jd.com']
43 
44     def start_requests(self):
45         #进入搜索页进行搜索
46         for each in self.start_urls:
47             yield splashrequest(each,callback=self.parse,endpoint='execute',
48                 args={'lua_source': lua_script})
49 
50     def parse(self, response):
51         item = jdsplashitem()
52         price = response.css('div.gl-i-wrap div.p-price i::text').getall()
53         page_num = response.xpath("//span[@class= 'p-num']/a[last()-1]/text()").get()
54         #这里使用了 xpath 函数 fn:string(arg):返回参数的字符串值。参数可以是数字、逻辑值或节点集。
55         #可能这就是 xpath 比 css 更精致的地方吧
56         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall()
57         #comment = response.css('div.gl-i-wrap div.p-commit').xpath('string(.//strong)').getall()
58         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall()
59         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall()
60         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()]
61         for each in zip(name, price, comment, publishstore,href):
62             item['name'] = each[0]
63             item['price'] = each[1]
64             item['comment'] = each[2]
65             item['p_store'] = each[3]
66             item['href'] = each[4]
67             yield item
68         #这里从第二页开始
69         url = 'https://search.jd.com/search?keyword=python3.7&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&s=%d&click=0'
70         for each_page in range(1,int(page_num)):
71             yield splashrequest(url%(each_page*2+1,each_page*60),callback=self.s_parse,endpoint='execute',
72                 args={'lua_source': lua_script2})
73 
74     def s_parse(self, response):
75         item = jdsplashitem()
76         price = response.css('div.gl-i-wrap div.p-price i::text').getall()
77         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall()
78         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall()
79         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall()
80         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()]
81         for each in zip(name, price, comment, publishstore, href):
82             item['name'] = each[0]
83             item['price'] = each[1]
84             item['comment'] = each[2]
85             item['p_store'] = each[3]
86             item['href'] = each[4]
87             yield item

各个文件的配置：

items.py :

 1 import scrapy
 2 
 3 
 4 class jdsplashitem(scrapy.item):
 5     # define the fields for your item here like:
 6     # name = scrapy.field()
 7     name = scrapy.field()
 8     price = scrapy.field()
 9     p_store = scrapy.field()
10     comment = scrapy.field()
11     href = scrapy.field()
12     pass

settings.py:

1 import scrapy_splash
2 # splash服务器地址
3 splash_url = 'http://192.168.99.100:8050'
4 # 开启splash的两个下载中间件并调整httpcompressionmiddleware的次序
5 downloader_middlewares = {
6 'scrapy_splash.splashcookiesmiddleware': 723,
7 'scrapy_splash.splashmiddleware': 725,
8 'scrapy.downloadermiddlewares.httpcompression.httpcompressionmiddleware': 810,
9 }

最后运行代码，可以看到书籍数据已经被爬取了：

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python如何查看网页代码

用python查看网页代码的方法：1、使用“import”导入requests包import requests2、使用requests包的get()函数通过网页... [阅读全文]
Python如何用wx模块创建文本编辑器

用python的wx模块创建文本编辑器的方法：1、设置按钮的位置import wxapp = wx.app()win = wx.frame(none,title... [阅读全文]
python如何保存文本文件

python保存文本文件的方法：使用python内置的open()类可以打开文本文件，向文件里面写入数据可以用write()函数，写完之后，使用close()函... [阅读全文]
python如何编写win程序

python可以编写win程序。win程序的格式是exe，下面我们就来看一下使用python编写exe程序的方法。编写好python程序后py2exe模块即可将... [阅读全文]
Python替换NumPy数组中大于某个值的所有元素实例

我有一个2d(二维) numpy数组，并希望用255.0替换大于或等于阈值t的所有值。据我所知，最基础的方法是：shape = arr.shaperesult ... [阅读全文]
使用Numpy对特征中的异常值进行替换及条件替换方式

原始数据为excel文件，由传感器获得，通过pyhton xlrd模块读入，读入后为数组形式，由于其存在部分异常值和缺失值，所以便利用numpy对其中的异常值进... [阅读全文]
Python 实现将numpy中的nan和inf,nan替换成对应的均值

nan：not a numberinf：infinity;正无穷numpy中的nan和inf都是float类型t!=t 返回bool类型的数组(矩阵)np.co... [阅读全文]
给ubuntu18安装python3.7的详细教程

参考文章准备工作安装工具sudo apt updatesudo apt upgradesudo apt install gccsudo apt install ... [阅读全文]
python爬虫把url链接编码成gbk2312格式过程解析

1. 问题　　抓取某个网站，发现请求参数是乱码格式，这是点击 textview，发现请求参数如下图所示3. 那么=%b9%fa%ce%f1%d4%ba%b7%a... [阅读全文]
pyecharts在数据可视化中的应用详解

使用pyecharts进行数据可视化安装 pip install pyecharts也可以在pycharm软件里进行下载pyecharts库包。下载成功后进行查... [阅读全文]