数据采集 - 获取【oschina】最新发布需求，并实时通知用户案例三

背景

有个朋友计划拓展业务渠道，准备在众包平台上接单，他的主营产品是微信小程序，因此他想第一时间收到客户发出的需求信息，然后第一时间联系客户，这样成交率才能够得到保障，否则单早都被其他同行接完了，他的黄花菜也就都凉了。

开发环境

开发语言 Python ，开发架构Scrapy，非 Python 莫属，数据采集的神器！
开发工具 PyCharm;

功能设计

实时通知：采用发邮件方式通知，将邮箱绑定到微信，实现实时通知的效果。
过滤模块：根据标题和内容双重过滤关键词，不符合要求的订单丢弃，符合要求的订单实时通知。
配置模块：采用json文件配置。

关键代码

采集模块

# -*- coding: utf-8 -*-
import time

import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from scrapy import Selector
from . import common


class OschinataskSpider(scrapy.Spider):
    name = 'oschinaTask'
    domain = "https://zb.oschina.net"
    allowed_domains = [domain]
    start_url = 'https://zb.oschina.net/projects/list.html'
    start_urls = [start_url]

    def __init__(self):
        options = Options()
        options.add_argument('--ignore-certificate-errors')
        options.add_argument('--disable-gpu')
        options = webdriver.ChromeOptions()
        self.driver = webdriver.Chrome(options=options)
        # self.driver.set_page_load_timeout(60)

    def close(self, spider, reason):
        self.driver.quit()

    def start_requests(self):
        # dont_filter=True Filtered duplicate request    no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
        yield scrapy.Request(self.start_url, callback=self.parse, dont_filter=True)


    def parse(self, response):
        print("parse ")
        nodes = response.xpath('//div[@class="el-col el-col-19"]').getall()
        sended_id = common.read_taskid()
        max_id = sended_id
        for node in nodes:
            title = Selector(text=node).xpath('//span[@class="title"]/a/text()').get()
            url = self.domain + Selector(text=node).xpath('//span[@class="title"]/a/@href').get()
            pos = url.find("id=")
            id_str = url[pos + 3:]
            id = int(id_str)
            # print(id_str)
            price = Selector(text=node).xpath('//span[@class="money"]/text()').get()
            skills = Selector(text=node).xpath('//span[@class="skills"]/text()').get()
            tags = Selector(text=node).xpath('//div[@class="tags mb-4"]/span/text()').getall()
            tag = ""
            for i in tags:
                tag = tag + i + " "
            subject = "oschina " + id_str + " " + title
            content = "%s <p> %s <p> <a href=%s>%s</a>  <p> %s" % (price, skills, url, url, tag)
            # print(content)
            if id > sended_id:
                if id > max_id:
                    max_id = id
                common.send_mail(subject, content)
            else:
                print("mail: task is already sended  <%r>" % id)
            time.sleep(3)
        # end for node in nodes
        # 记录最大id
        common.write_taskid(id=max_id)

        time.sleep(10 * 60)
        # 循环爬取单个url，在这里！！！！！  dont_filter 这个坑坑
        yield scrapy.Request(self.start_url, callback=self.parse, dont_filter=True)

通知模块

 
def send_mail(subject, content):
    sender = u'xxxxx@qq.com'  # 发送人邮箱
    passwd = u'xxxxxx'  # 发送人邮箱授权码
    receivers = u'xxxxx@qq.com'  # 收件人邮箱
 
    # subject = u'一品威客 开发任务 ' #主题
    # content = u'这是我使用python smtplib模块和email模块自动发送的邮件'    #正文
    try:
        # msg = MIMEText(content, 'plain', 'utf-8')
        msg = MIMEText(content, 'html', 'utf-8')
        msg['Subject'] = subject
        msg['From'] = sender
        msg['TO'] = receivers
 
        s = smtplib.SMTP_SSL('smtp.qq.com', 465)
        s.set_debuglevel(1)
        s.login(sender, passwd)
        s.sendmail(sender, receivers, msg.as_string())
        return True
    except Exception as e:
        print(e)
        return False

总结

程序上线后稳定运行，实现了预期的效果，接单率效果杠杠的！

附：Scrapy 结构图

-------------------------------------------------------------------------------------------------------------------

本次分享结束，欢迎讨论！QQ微信同号： 6550523

本文章仅供技术交流，不得商用，不得转载，违者必究。

本文地址：https://blog.csdn.net/lildkdkdkjf/article/details/107151659

您可能感兴趣的文章:

如您对本文有疑问或者有任何想说的，请点击进行留言回复，万千网友为您解惑！

python 实现aes256加密

基础知识# 在linux操作系统下，python3的默认环境编码变为了utf-8编码，所以在编写代码的时候，字符串大部分都是以utf-8处理utf-8:1byt... [阅读全文]

Numpy中np.max的用法及np.maximum区别

numpy中np.max(即np.amax)的用法>>> import numpy as np>>> help(np.max... [阅读全文]

Python函数调用追踪实现代码

对于分布式追踪，主要有以下的几个概念：追踪 trace：就是由分布的微服务协作所支撑的一个事务。一个追踪，包含为该事务提供服务的各个服务请求。跨度 sp... [阅读全文]

python 用opencv实现图像修复和图像金字塔

我们将学习如何通过一种称为修复的方法去除旧照片中的小噪音，笔画等。基本思路很简单：用相邻像素替换那些坏标记，使其看起来像邻域。cv2.inpaint（） cv... [阅读全文]

python爬虫中采集中遇到的问题整理

在爬虫的获取数据上，一直在讲一些爬取的方法，想必小伙伴们也学习了不少。在学习的过程中遇到了问题，大家也会一起交流解决，找出不懂和出错的地方。今天小编想就爬虫采集... [阅读全文]

Python爬虫爬取有道实现翻译功能

准备首先安装爬虫urllib库pip install urllib获取有道翻译的链接url需要发送的参数在form data里示例import urllib.r... [阅读全文]

Python GUI库Tkiner使用方法代码示例

前言tkinter 是 python 的标准 gui 库。python 使用 tkinter 可以快速的创建 gui 应用程序。由于 tkinter 是内置到 ... [阅读全文]

python操作toml文件的示例代码

# -*- coding: utf-8 -*-# @time : 2019-11-18 09:31# @author : cxa# @file : toml_d... [阅读全文]

Python模拟键盘输入自动登录TGP

#-*- coding: utf-8 -*-import win32api,win32gui, win32conimport osimport time#os.... [阅读全文]

python 实现控制鼠标键盘

1、安装类库pip install pyautogui2、代码：import pyautogui,time,randompyautogui.pause = 3p... [阅读全文]


验证码：

验证码：

数据采集 - 获取【oschina】最新发布需求，并实时通知用户案例三

2020年07月07日 | 移动技术网IT编程 | 我要评论

背景

开发环境

功能设计

关键代码

总结

附：Scrapy 结构图

您可能感兴趣的文章:

相关文章:

网友评论

数据采集 - 获取【oschina】最新发布需求，并实时通知用户 案例三

2020年07月07日 | 移动技术网IT编程 | 我要评论

背景

开发环境

功能设计

关键代码

总结

附：Scrapy 结构图

您可能感兴趣的文章:

相关文章:

网友评论

数据采集 - 获取【oschina】最新发布需求，并实时通知用户案例三