当前位置：移动技术网 > IT编程>脚本编程>Python > Python爬虫入门教程 36-100 酷安网全站应用爬虫 scrapy

Python爬虫入门教程 36-100 酷安网全站应用爬虫 scrapy

2019年02月20日 | 移动技术网IT编程 | 我要评论

森威鸟,超级高科技霸主,订制窗帘

爬前叨叨

2018年就要结束了，还有4天，就要开始写2019年的教程了，没啥感动的，一年就这么过去了，今天要爬取一个网站叫做酷安，是一个应用商店，大家可以尝试从手机app爬取，不过爬取app的博客，我打算在50篇博客之后在写，所以现在就放一放啦~~~

python3爬虫入门教程

酷安网站打开首页之后是一个广告页面，点击头部的应用即可

python3爬虫入门教程

页面分析

分页地址找到，这样就可以构建全部页面信息
python3爬虫入门教程入图片描述

我们想要保存的数据找到，用来后续的数据分析
python3爬虫入门教程

python3爬虫入门教程

上述信息都是我们需要的信息，接下来，只需要爬取即可，本篇文章使用的还是scrapy，所有的代码都会在文章中出现，阅读全文之后，你就拥有完整的代码啦

import scrapy

from apps.items import appsitem  # 导入item类
import re  # 导入正则表达式类

class appsspider(scrapy.spider):
    name = 'apps'
    allowed_domains = ['www.coolapk.com']
    start_urls = ['https://www.coolapk.com/apk?p=1']
    custom_settings = {
        "default_request_headers" :{
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'accept-language': 'en',
            'user-agent':'mozilla/5.0 你的ua'

        }
    }

代码讲解

custom_settings 第一次出现，目的是为了修改默认setting.py 文件中的配置

    def parse(self, response):
        list_items = response.css(".app_left_list>a")
        for item in list_items:
            url = item.css("::attr('href')").extract_first()

            url = response.urljoin(url)

            yield scrapy.request(url,callback=self.parse_url)

        next_page = response.css('.pagination li:nth-child(8) a::attr(href)').extract_first()
        url = response.urljoin(next_page)
        yield scrapy.request(url, callback=self.parse)

代码讲解

response.css 可以解析网页，具体的语法，你可以参照上述代码，重点阅读 ::attr('href') 和 ::text

response.urljoin 用来合并url

next_page 表示翻页

parse_url函数用来解析内页，本函数内容又出现了3个辅助函数，分别是self.getinfo(response),self.gettags(response)，self.getappinfo(response) 还有response.css().re支持正则表达式匹配，可以匹配文字内部内容

   def parse_url(self,response):
        item = appsitem()

        item["title"] = response.css(".detail_app_title::text").extract_first()
        info = self.getinfo(response)

        item['volume'] = info[0]
        item['downloads'] = info[1]
        item['follow'] = info[2]
        item['comment'] = info[3]

        item["tags"] = self.gettags(response)
        item['rank_num'] = response.css('.rank_num::text').extract_first()
        item['rank_num_users'] = response.css('.apk_rank_p1::text').re("共(.*?)个评分")[0]
        item["update_time"],item["rom"],item["developer"] = self.getappinfo(response)

        yield item

三个辅助方法如下

    def getinfo(self,response):

        info = response.css(".apk_topba_message::text").re("\s+(.*?)\s+/\s+(.*?)下载\s+/\s+(.*?)人关注\s+/\s+(.*?)个评论.*?")
        return info

    def gettags(self,response):
        tags = response.css(".apk_left_span2")
        tags = [item.css('::text').extract_first() for item in tags]

        return tags

    def getappinfo(self,response):
        #app_info = response.css(".apk_left_title_info::text").re("[\s\s]+更新时间：(.*?)")
        body_text = response.body_as_unicode()

        update = re.findall(r"更新时间：(.*)?[<]",body_text)[0]
        rom =  re.findall(r"支持rom：(.*)?[<]",body_text)[0]
        developer = re.findall(r"开发者名称：(.*)?[<]", body_text)[0]
        return update,rom,developer

保存数据

数据传输的item在这个地方就不提供给你了，需要从我的代码中去推断一下即可，哈哈

import pymongo

class appspipeline(object):

    def __init__(self,mongo_url,mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get("mongo_url"),
            mongo_db=crawler.settings.get("mongo_db")
        )

    def open_spider(self,spider):
        try:
            self.client = pymongo.mongoclient(self.mongo_url)
            self.db = self.client[self.mongo_db]
            
        except exception as e:
            print(e)

    def process_item(self, item, spider):
        name = item.__class__.__name__

        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

代码解读

open_spider 开启爬虫时，打开mongodb

process_item 存储每一条数据

close_spider 关闭爬虫

重点查看本方法 from_crawler 是一个类方法，在初始化的时候，从setting.py中读取配置

spider_modules = ['apps.spiders']
newspider_module = 'apps.spiders'
mongo_url = '127.0.0.1'
mongo_db = 'kuan'

python3爬虫入门教程

得到数据

调整一下爬取速度和并发数

download_delay = 3
# the download delay setting will honor only one of:
concurrent_requests_per_domain = 8

代码走起，经过一系列的努力，得到数据啦！！！
python3爬虫入门教程

抽空写个酷安的数据分析，有需要源码的，自己从头到尾的跟着写一遍就o98k了

python3爬虫入门教程

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python如何查看网页代码

用python查看网页代码的方法：1、使用“import”导入requests包import requests2、使用requests包的get()函数通过网页... [阅读全文]
Python如何用wx模块创建文本编辑器

用python的wx模块创建文本编辑器的方法：1、设置按钮的位置import wxapp = wx.app()win = wx.frame(none,title... [阅读全文]
python如何保存文本文件

python保存文本文件的方法：使用python内置的open()类可以打开文本文件，向文件里面写入数据可以用write()函数，写完之后，使用close()函... [阅读全文]
python如何编写win程序

python可以编写win程序。win程序的格式是exe，下面我们就来看一下使用python编写exe程序的方法。编写好python程序后py2exe模块即可将... [阅读全文]
Python替换NumPy数组中大于某个值的所有元素实例

我有一个2d(二维) numpy数组，并希望用255.0替换大于或等于阈值t的所有值。据我所知，最基础的方法是：shape = arr.shaperesult ... [阅读全文]
使用Numpy对特征中的异常值进行替换及条件替换方式

原始数据为excel文件，由传感器获得，通过pyhton xlrd模块读入，读入后为数组形式，由于其存在部分异常值和缺失值，所以便利用numpy对其中的异常值进... [阅读全文]
Python 实现将numpy中的nan和inf,nan替换成对应的均值

nan：not a numberinf：infinity;正无穷numpy中的nan和inf都是float类型t!=t 返回bool类型的数组(矩阵)np.co... [阅读全文]
给ubuntu18安装python3.7的详细教程

参考文章准备工作安装工具sudo apt updatesudo apt upgradesudo apt install gccsudo apt install ... [阅读全文]
python爬虫把url链接编码成gbk2312格式过程解析

1. 问题　　抓取某个网站，发现请求参数是乱码格式，这是点击 textview，发现请求参数如下图所示3. 那么=%b9%fa%ce%f1%d4%ba%b7%a... [阅读全文]
pyecharts在数据可视化中的应用详解

使用pyecharts进行数据可视化安装 pip install pyecharts也可以在pycharm软件里进行下载pyecharts库包。下载成功后进行查... [阅读全文]