当前位置：移动技术网 > IT编程>脚本编程>Python > 拉钩爬虫

拉钩爬虫

2019年06月03日 | 移动技术网IT编程 | 我要评论

中华新闻网,如何打好乒乓球,石津宇

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import json
import re
import time

import lxml.html
from selenium import webdriver
from selenium.webdriver.common.by import by
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as ec

from redis_cache import rediscache


class lagouspider(object):

    def __init__(self):
        # 调用webdriver.chrome()启动浏览器
        self.driver = webdriver.chrome()
        self.url = 'https://www.lagou.com/jobs/list_python?labelwords=&fromsearch=true&suginput='
        self.detail_url = none

    def run(self):
        # 获得url打开浏览器
        self.driver.get(self.url)
        while true:
            # 获取当前页面源代码
            source = self.driver.page_source
            # 进行等待页面加载,如果需要的内容已出现,就进行下一步
            webdriverwait(driver=self.driver, timeout=10).until(
                ec.presence_of_element_located((by.xpath, '//div[@class="pager_container"]/span[last()]'))
            )
            # 将source传入parse_list_page函数进行解析
            self.parse_list_page(source)
            try:
                next_btn = self.driver.find_element_by_xpath('//div[@class="pager_container"]/span[last()]')
                if "pager_next_disabled" in next_btn.get_attribute("class"):
                    break
                else:
                    next_btn.click()
            except:
                print(source)
            time.sleep(1)

    def parse_list_page(self, source):
        """
        进行原始页面解析
        :param source:
        :return:
        """
        html = lxml.html.fromstring(source)
        # 获取详情页链接集
        links = html.xpath('//a[@class="position_link"]/@href')
        for link in links:
            self.detail_url = link
            self.requests_detail_page(link)
            time.sleep(1)

    def requests_detail_page(self,url):
        """
        请求详情页信息
        :param url:
        :return:
        """
        self.driver.execute_script("window.open('%s')" % url)
        self.driver.switch_to.window(self.driver.window_handles[1])
        webdriverwait(self.driver, timeout=10).until(
            ec.presence_of_element_located((by.xpath, '//div[@class="job-name"]//span[@class="name"]'))
        )
        source  = self.driver.page_source
        self.parse_datail_page(source)
        self.driver.close()
        self.driver.switch_to.window(self.driver.window_handles[0])

    def parse_datail_page(self, source):
        """详情页解析"""
        html = lxml.html.fromstring(source)

        job_name = html.xpath('//div[@class="job-name"]//span[@class="name"]/text()')[0]
        job_salary = html.xpath('//dd[@class="job_request"]/p//span[1]/text()')[0]
        job_city = html.xpath('//dd[@class="job_request"]/p//span[2]/text()')[0]
        job_city = re.sub(r"[\s/]", "", job_city)
        experience = html.xpath('//dd[@class="job_request"]/p//span[3]/text()')[0].strip()
        experience = re.sub(r"[\s/]", "", experience)
        education = html.xpath('//dd[@class="job_request"]/p//span[4]/text()')[0]
        education = re.sub(r"[\s/]", "", education)
        job_time = html.xpath('//dd[@class="job_request"]/p//span[5]/text()')[0]
        job_advantage = html.xpath('//dd[@class="job-advantage"]/p/text()')[0]
        desc = "".join(html.xpath('//dd[@class="job_bt"]//text()')).strip()
        job_address = "".join(html.xpath('//div[@class="work_addr"]//text()'))
        job_address = re.sub(r"[\s/]", "", job_address)[0:-4]

        position = {
            'job_name': job_name,
            'job_salary': job_salary,
            'job_city': job_city,
            'experience': experience,
            'education': education,
            'job_advantage': job_advantage,
            'desc': desc,
            'job_address': job_address,
            'job_time': job_time,
        }

        rc = rediscache()
        rc[self.detail_url] = position
        position_print = json.loads(rc[self.detail_url])
        print(self.detail_url)
        print(position_print)
        print('='*40)


if __name__ == '__main__':
    spider = lagouspider()
    spider.run()

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python如何查看网页代码

用python查看网页代码的方法：1、使用“import”导入requests包import requests2、使用requests包的get()函数通过网页... [阅读全文]
Python如何用wx模块创建文本编辑器

用python的wx模块创建文本编辑器的方法：1、设置按钮的位置import wxapp = wx.app()win = wx.frame(none,title... [阅读全文]
python如何保存文本文件

python保存文本文件的方法：使用python内置的open()类可以打开文本文件，向文件里面写入数据可以用write()函数，写完之后，使用close()函... [阅读全文]
python如何编写win程序

python可以编写win程序。win程序的格式是exe，下面我们就来看一下使用python编写exe程序的方法。编写好python程序后py2exe模块即可将... [阅读全文]
Python替换NumPy数组中大于某个值的所有元素实例

我有一个2d(二维) numpy数组，并希望用255.0替换大于或等于阈值t的所有值。据我所知，最基础的方法是：shape = arr.shaperesult ... [阅读全文]
使用Numpy对特征中的异常值进行替换及条件替换方式

原始数据为excel文件，由传感器获得，通过pyhton xlrd模块读入，读入后为数组形式，由于其存在部分异常值和缺失值，所以便利用numpy对其中的异常值进... [阅读全文]
Python 实现将numpy中的nan和inf,nan替换成对应的均值

nan：not a numberinf：infinity;正无穷numpy中的nan和inf都是float类型t!=t 返回bool类型的数组(矩阵)np.co... [阅读全文]
给ubuntu18安装python3.7的详细教程

参考文章准备工作安装工具sudo apt updatesudo apt upgradesudo apt install gccsudo apt install ... [阅读全文]
python爬虫把url链接编码成gbk2312格式过程解析

1. 问题　　抓取某个网站，发现请求参数是乱码格式，这是点击 textview，发现请求参数如下图所示3. 那么=%b9%fa%ce%f1%d4%ba%b7%a... [阅读全文]
pyecharts在数据可视化中的应用详解

使用pyecharts进行数据可视化安装 pip install pyecharts也可以在pycharm软件里进行下载pyecharts库包。下载成功后进行查... [阅读全文]

网友评论


验证码：

拉钩爬虫

2019年06月03日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论