当前位置: 移动技术网 > IT编程>脚本编程>Python > 在scrapy中使用phantomJS实现异步爬取的方法

在scrapy中使用phantomJS实现异步爬取的方法

2019年01月08日  | 移动技术网IT编程  | 我要评论

植物人疏月,丛氏,合福高速铁路

使用selenium能够非常方便的获取网页的ajax内容,并且能够模拟用户点击和输入文本等诸多操作,这在使用scrapy爬取网页的过程中非常有用。

网上将selenium集成到scrapy的文章很多,但是很少有能够实现异步爬取的,下面这段代码就重写了scrapy的downloader,同时实现了selenium的集成以及异步。

使用时需要phantomjsdownloadhandler添加到配置文件的downloader中。

# encoding: utf-8
from __future__ import unicode_literals
 
from scrapy import signals
from scrapy.signalmanager import signalmanager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import failure
 
 
class phantomjsdownloadhandler(object):
 
 def __init__(self, settings):
  self.options = settings.get('phantomjs_options', {})
 
  max_run = settings.get('phantomjs_maxrun', 10)
  self.sem = defer.deferredsemaphore(max_run)
  self.queue = queue.lifoqueue(max_run)
 
  signalmanager(dispatcher.any).connect(self._close, signal=signals.spider_closed)
 
 def download_request(self, request, spider):
  """use semaphore to guard a phantomjs pool"""
  return self.sem.run(self._wait_request, request, spider)
 
 def _wait_request(self, request, spider):
  try:
   driver = self.queue.get_nowait()
  except queue.empty:
   driver = webdriver.phantomjs(**self.options)
 
  driver.get(request.url)
  # ghostdriver won't response when switch window until page is loaded
  dfd = threads.defertothread(lambda: driver.switch_to.window(driver.current_window_handle))
  dfd.addcallback(self._response, driver, spider)
  return dfd
 
 def _response(self, _, driver, spider):
  body = driver.execute_script("return document.documentelement.innerhtml")
  if body.startswith("<head></head>"): # cannot access response header in selenium
   body = driver.execute_script("return document.documentelement.textcontent")
  url = driver.current_url
  respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
  resp = respcls(url=url, body=body, encoding="utf-8")
 
  response_failed = getattr(spider, "response_failed", none)
  if response_failed and callable(response_failed) and response_failed(resp, driver):
   driver.close()
   return defer.fail(failure())
  else:
   self.queue.put(driver)
   return defer.succeed(resp)
 
 def _close(self):
  while not self.queue.empty():
   driver = self.queue.get_nowait()
   driver.close()

以上这篇在scrapy中使用phantomjs实现异步爬取的方法就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持移动技术网。

如对本文有疑问,请在下面进行留言讨论,广大热心网友会与你互动!! 点击进行留言回复

相关文章:

验证码:
移动技术网