当前位置：移动技术网 > IT编程>脚本编程>Python > 多任务Python爬虫

多任务Python爬虫

2020年03月31日 | 移动技术网IT编程 | 我要评论

局长成长史3082,蓝旖琳与狗图片,蜥蜴有毒吗

一、多任务简介

1、为什么要使用多任务爬虫？

在大量的url需要请求时，单线程/单进程去爬取，速度太慢，此时cpu不工作，浪费cpu资源。
爬取与写入文件分离，可以规避io操作，增加爬取速度，充分利用cpu。

2、多任务分类

进程：进程是操作资源分配的最小单位，一个运行的程序，至少包括一个进程，进程之间数据不能共享。（利用多核）
线程：线程是cpu调度的最小单位，一个进程中至少含有一个线程，线程中数据是共享的，如果多个线程操作同一个对象时，需要考虑数据安全问题。（爬虫中最常用）
协程：协程位于线程内部，如果一个线程中运行的代码，遇到io操作时，切换到线程其他代码执行（最大程度的规避io操作）

2、如何提高程序的运行速度

1、提高cpu的利用率

假如我们的程序有只有一个线程，cpu就只处理这一个线程。如果在程序中遇到io操作。此时cpu就不工作了。休息的这段时间，就浪费了cpu的资源。

若我们的程序是多线程的，cpu会在这多个任务之间切换，如果其中一个线程阻塞了，cpu不会休息，会处理其他线程。

2、增加cpu数量

一个cpu同一时间只能护理一个任务，若我们增加cpu数量，那么多个cpu处理多个任务，也会提升程序的运行速度，例如使用多进程。

二、python中的threading模块（开启多线程）

cpython 解释器下的 python中没有真正的多线程（因为多个线程不能同时在多核上执行，只能在一个cpu上进行多个线程的切换轮流执行，在视觉效果上看起来同时在执行），造成这个情况的原因是因为gil（全局性解释器锁），在一个进程中，多个线程是数据共享的，如果不设置全局解释性锁，多个线程可能在同一时间对同一个变量进行操作，造成变量的引用计数不正确，影响其进行垃圾回收，所以需要加全局性解释器锁。

2.1、多线程开启方法

from threading import thread
1、使用函数
t = thread(
					target=线程执行的任务（方法）名字，
					args = 执行方法的参数，是一个元组
				)---创建线程
t.start()---启动线程

2、使用类
class mythread(thread)
	def __init__(self,参数)
		self.参数=参数
		super(mythread,self).__init__()
	
	def run(self):
		将需要多任务执行的代码，添加到此处

if __name__ == '__main__':
    my =  mythread(参数)
    my.start()

2.2、线程中常用的几个方法

from threading import thread, current_thread, enumerate, active_count
import time
import random


class mythread(thread):
    def run(self):
        time.sleep(random.random())
        msg = "i'm" + self.name + "@" + str(i)  #self.name 当前线程名
        print(msg)
        print(current_thread().ident)  #当前线程的id号
        print(current_thread().is_alive()) #当前线程是否存活


if __name__ == '__main__':
    t_list=[]
    for i in range(5):
        t = mythread()
        t.start()
        t_list.append(t)
    while active_count() > 1:  #active_count() 当前存活线程数，包括主线程
        print(enumerate()) #enumerate() 当前存活线程列表，包括主线程
     for i  in t_list:
        i.join() #join方法，会使异步执行的多线程，变为同步执行，主线程会等i线程执行完，才会往下执行。

2.3、守护线程

守护线程，当一个子线程设置为守护线程时，该子线程会等待其他非守护子线程和主线程执行完成后，结束线程。

from threading import thread, current_thread
import time


def bar():
    while true:
        time.sleep(1)
        print(current_thread().name)


def foo():
    print(f'{current_thread().name}开始了...')
    time.sleep(2)
    print(f'{current_thread().name}结束了...')


if __name__ == '__main__':
    t1 = thread(target=bar)
    t1.daemon = true #将t1设置为守护线程，
    t1.start()
    t2 = thread(target=foo)
    t2.start()

#执行结果
thread-2开始了...
thread-1
thread-1
thread-2结束了...

2.4、锁

在使用多线程爬虫的时候，有时候多个线程会对同一个文件进行读写。造成数据不安全，下面是一个tencent招聘的例子，在写入excel文件中的时候，由于多个线程对同一个文件进行写入操作，造成数据不安全。

import requests
from jsonpath import jsonpath
from excle_wirte import excelutils
from threading import thread
import os
from multiprocessing import lock
import threading

def get_content(url):
    headers = {
        'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/80.0.3987.149 safari/537.36',
        'referer': 'https://careers.tencent.com/search.html'
    }
    print(url)
    res = requests.get(url, headers=headers).json()
    jp = jsonpath(res, '$.*.posts.*')
    return jp


def write_excel(filename, item_list, sheetname):
    if not os.path.exists(filename):
        excelutils.write_to_excel(filename, item_list, sheetname)
    else:
        excelutils.append_to_excel(filename, item_list)


def main(i, lock):
    base_url = 'https://careers.tencent.com/tencentcareer/api/post/query?timestamp=1585401795646&countryid=&cityid=&bgids=&productid=&categoryid=&parentcategoryid=&attrid=&keyword=&pageindex={}&pagesize=20&language=zh-cn&area=cn'
    content = get_content(base_url.format(i))
    with lock:   #加锁
        write_excel('tencent.xls', content, 'hr')


if __name__ == '__main__':
    lock = lock()  #创建锁
    for i in range(1, 11):
        t = thread(target=main, args=(i, lock))
        t.start()

2.5、生产者与消费者模型

生产者和消费者问题是线程模型中的经典问题：生产者和消费者在同一时间段内共用同一个存储空间，生产者往存储空间中添加产品，消费者从存储空间中取走产品，当存储空间为空时，消费者阻塞，当存储空间满时，生产者阻塞。

例子：tencent招聘生产者与消费者版本,我这里是用函数写的，当然也可以用类来写，会更加方便。

import requests
from jsonpath import jsonpath
from excle_wirte import excelutils
from threading import thread
import os
from multiprocessing import lock
from queue import queue

flag = false


def ger_url_list(num, url_queue):
    base_url = 'https://careers.tencent.com/tencentcareer/api/post/query?timestamp=1585401795646&countryid=&cityid=&bgids=&productid=&categoryid=&parentcategoryid=&attrid=&keyword=&pageindex={}&pagesize=20&language=zh-cn&area=cn'
    for i in range(1, num + 1):
        url_queue.put(base_url.format(i))


def producer(url_queue, content_queue):
    headers = {
        'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/80.0.3987.149 safari/537.36',
        'referer': 'https://careers.tencent.com/search.html'
    }
    while true:
        try:
            url = url_queue.get_nowait()
            res = requests.get(url, headers=headers).json()
            jp = jsonpath(res, '$.*.posts.*')
            content_queue.put(jp)
        except exception as e:
            break


def consumer(content_queue, lock, filename, sheetname):
    while true:
        if content_queue.empty() and flag:
            break
        try:
            item_list = content_queue.get_nowait()
            with lock:
                if not os.path.exists(filename):
                    excelutils.write_to_excel(filename, item_list, sheetname)
                else:
                    excelutils.append_to_excel(filename, item_list)
        except exception as e:
            pass


if __name__ == '__main__':
    p_t_list = []
    url_queue = queue()   #存放url的队列
    content_queue = queue()  #网页内容队列
    ger_url_list(10, url_queue)  #往url队列添加url
    lock = lock() #创建锁对象
    for i in range(4): # 开启四个线程来抓取网页内容
        p_t = thread(target=producer, args=(url_queue, content_queue))
        p_t.start()
        p_t_list.append(p_t)
    for i in range(4): #四个线程来解析内容和写入文件
        t = thread(target=consumer, args=(content_queue, lock, 'tencent.xls', 'hr'))
        t.start()
    for i in p_t_list:
        i.join()
    flag=true #判断标志，用来判断生产者是否生产完毕。

2.6、多进程

多进程一般用于处理计算密集型任务，在爬虫方面用的较少，因为多进程开启数量依赖于cpu核心数，且多进程开启操作系统需要为每个进程分配资源，效率不高。这里只简单说明python中使用的库和使用方法，注意进程间不能之间进行数据交换，需要依赖于ipc(inter-process communication)进程间通信，提供了各种进程间通信的方法进行数据交换），常用方法为队列和管道和socket。当然还有第三方工具，例如 rabbitmq ， redis

from multiprocessing import process
1、使用函数
t = thread(
					target=进程执行的任务（方法）名字，
					args = 执行方法的参数，是一个元组
				)---创建进程
t.start()---启动进程

2、使用类
class myprocess(process)
	def __init__(self,参数)
		self.参数=参数
		super(mythread,self).__init__()
	
	def run(self):
		将需要多任务执行的代码，添加到此处

if __name__ == '__main__':
    my =  myprocess(参数)
    my.start()

在 multiprocessing 这个库中有很多于多进程相关对象

from multiprocessing import queue, pipe, pool,等
queue：队列 
pipe：管道
pool：池（有另外的模块，统一了进程池，线程池的接口，使用更加方便）

三、池

3.1、什么是池

池，包括线程池与进程池，一个池内，可以含有指定的线程数，或者是进程数，多个任务，从中拿取线程/进程执行任务，执行完成后，下一个任务再从池中拿取线程/进程。直到所有任务都执行完毕。

3.2、为什么使用池

可以比较好的控制开启线程/线程的数量，在提升效率的同时又控制住资源开销。
可以指定回调函数，很方便的处理返回数据

3.2、池的简单使用，以进程池为例，线程池一样的操作。

from concurrent.futures import threadpoolexecutor, processpoolexecutor


def fun(i):
    return i ** 2


def pr(con):
    p = con.result()
    print(p)


if __name__ == '__main__':
    p_pool = processpoolexecutor(max_workers=4)  #创建一个含有四个进程的池
    for i in range(10): #10个任务
        p = p_pool.submit(fun, i)  #任务提交
        p.add_done_callback(pr)  #指定回调函数
    p_pool.shutdown()#关闭池
#执行结果
0
1
4
9
16
25
36
49
64
81

3.3、池map方法使用，适合于简单参数

from concurrent.futures import threadpoolexecutor, processpoolexecutor


def fun(i):
    return i ** 2
   
if __name__ == '__main__':
    p_pool = processpoolexecutor(max_workers=4)
    p = p_pool.map(fun, range(10))
    print(list(p)) #map方法返回的是一个生成器，可通过强转或者循环取值。

#执行结果
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

新手学习Python2和Python3中print不同的用法

在python2和python3中都提供print()方法来打印信息,但两个版本间的print稍微有差异主要体现在以下几个方面：1.python3中print是... [阅读全文]
Python基于os.environ从windows获取环境变量

安装python之后，我们往往面临这样一个问题，在命令行输入“python”，竟然出错，难道是没有安装成功吗？非也，其实是你的系统环境变量没有设置好。今天，小编... [阅读全文]
keras实现调用自己训练的模型,并去掉全连接层

其实很简单from keras.models import load_modelbase_model = load_model('model_resenet.h... [阅读全文]
python中def是做什么的

python使用def开始函数定义，紧接着是函数名，括号内部为函数的参数，内部为函数的具体功能实现代码，如果想要函数有返回值, 在expressions中的逻... [阅读全文]
Python xlwt模块使用代码实例

简介写入excle文档安装：pip3 install xlwt导入：import xlwtxlrd 模块方法写入案例import xlwt# 创建对象，设置编码... [阅读全文]
Keras之自定义损失(loss)函数用法说明

在keras中可以自定义损失函数，在自定义损失函数的过程中需要注意的一点是，损失函数的参数形式，这一点在keras中是固定的，须如下形式：def my_loss... [阅读全文]
Python xlrd模块导入过程及常用操作

简介读取excle文档，支持xls，xlsx格式安装：pip3 install xlrd导入：import xlrdxlrd 模块方法读取excelfile =... [阅读全文]
keras打印loss对权重的导数方式

notes怀疑模型梯度爆炸，想打印模型 loss 对各权重的导数看看。如果如果fit来训练的话，可以用keras.callbacks.tensorboard实现... [阅读全文]
keras 使用Lambda 快速新建层添加多个参数操作

keras许多简单操作，都需要新建一个层，使用lambda可以很好完成需求。# 额外参数def normal_reshape(x, shape): return... [阅读全文]
JAVA及PYTHON质数计算代码对比解析

java 实现class primenumber{public static void main(string[] args) {long start=syst... [阅读全文]

网友评论


验证码：