Python爬取小说《斗罗大陆》_Python

Python爬取小说《斗罗大陆》

本人新手，学习python爬虫时间不久，写的不好之处，还请各大神谅解。

这个是我们今天要爬取的小说地址：

在爬取小说之前，我们先对网页内容进行了解。首先按F12打开开发者工具，分析网页代码，如下图所示：发现所有章节在div id = " list "的下面。这样，我们得到了小说的章节和标题。
在这里插入图片描述
接下来分析每个章节页面的内容，（同上，按F12对正文网页进行检查），经过观察发现，文章的所有内容都存放在div id = " content "的标签下。如下图：

这样，我们的准备工作就算完成了。下面就开始编写我们的爬虫程序了。实现准备要用到的库，requests，re，os，lxml。

接着，我们我们导入相关库，并仿造一个请求头，关于请求头的相关内容这里就不做介绍了，在上一篇博文中有介绍过（爬取表情包）

下面就开始我们的爬虫了。首先，定义一个函数，用来获取网页源代码，并通过etree与xpath对网页源码进行解析，代码如下：

def get_info(url):
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    get_info_list = []
    html = etree.HTML(response.text)
    dd_list = html.xpath('//*[@id="list"]/dl/dd')
    for dd in dd_list:
        title = dd.xpath('a/text()')[0]
        href = 'http://www.biquku.la/0/421/' + dd.xpath('a/@href')[0]
        chapter = {'title': title, 'href': href}
        get_info_list.append(chapter)
    return get_info_list

现在开始获取每个章节的内容，这里用正则表达式获取内容，并写入文件中。代码如下：

def get_content(get_info):
    for chapter_info in get_info:
        response = requests.get(url=chapter_info['href'], headers=headers)
        response.encoding = 'utf-8'
        if os.path.exists('斗罗大陆'):
            pass
        else:
            os.makedirs('斗罗大陆')
        contents = re.findall('<div id="content">(.*?)</div>', response.text)
        with open('./斗罗大陆/' + chapter_info['title'] + '.txt', 'w', encoding='utf-8') as f:
            for content in contents:
                f.write(content.replace('&nbsp;&nbsp;&nbsp;&nbsp;', '').replace('<br/><br/>', '\n').strip())
            print('下载成功')

最后调用主函数，执行爬虫程序，代码如下：

if __name__=='__main__':
	get_content(get_info(url))

看看我们的爬取结果吧，这下可以在本地阅读了。
在这里插入图片描述
附源代码：

import requests
import re, os
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
url = 'http://www.biquku.la/0/421/'


def get_info(url):
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    get_info_list = []
    html = etree.HTML(response.text)
    dd_list = html.xpath('//*[@id="list"]/dl/dd')
    for dd in dd_list:
        title = dd.xpath('a/text()')[0]
        href = 'http://www.biquku.la/0/421/' + dd.xpath('a/@href')[0]
        chapter = {'title': title, 'href': href}
        get_info_list.append(chapter)
    return get_info_list


def get_content(get_info):
    for chapter_info in get_info:
        response = requests.get(url=chapter_info['href'], headers=headers)
        response.encoding = 'utf-8'
        if os.path.exists('斗罗大陆'):
            pass
        else:
            os.makedirs('斗罗大陆')
        contents = re.findall('<div id="content">(.*?)</div>', response.text)
        with open('./斗罗大陆/' + chapter_info['title'] + '.txt', 'w', encoding='utf-8') as f:
            for content in contents:
                f.write(content.replace('&nbsp;&nbsp;&nbsp;&nbsp;', '').replace('<br/><br/>', '\n').strip())
            print('下载成功')


if __name__ == '__main__':
    get_content(get_info(url))

如有错误，欢迎私信纠正，谢谢支持！

本文地址：https://blog.csdn.net/qq_47183158/article/details/107441765

您可能感兴趣的文章:

如您对本文有疑问或者有任何想说的，请点击进行留言回复，万千网友为您解惑！

python 用struct模块解决黏包问题

为什么会出现黏包现象：　　首先只有在tcp协议中才会出现黏包现象，是因为tcp协议是面向流的协议，在发送的数据传输的过程中还有缓存机制来避免数据丢失，因此，在连... [阅读全文]

python tkinter的消息框模块(messagebox,simpledialog)

tkinter提供了三个模块，可以创建弹出对话窗口：（使用必须单独导入模块）1.messagebox　　消息对话框　　示例：askokcancelimport ... [阅读全文]

Python3读写ini配置文件的示例

ini文件即initialization file初始化文件，在应用程序及框架中常作为配置文件使用，是一种静态纯文本文件，使用记事本即可编辑。配置文件的主要功能... [阅读全文]

基于Python实现全自动下载抖音视频

很多人喜欢玩抖音，我也喜欢看抖音小姐姐，可拿着手机一个个找视频太费劲。作为一个程序员，如何能在电脑前一边编程一边轻松地看抖音小姐姐呢？下面利用python，简单... [阅读全文]

python如何写个俄罗斯方块

俄罗斯方块是俄罗斯人发明的一款休闲类的小游戏，这款小游戏可以说是很多人童年的主打电子游戏了，本文我们使用 python 来实现这款小游戏。游戏的基本规则是：移动... [阅读全文]

基于Python模拟浏览器发送http请求

1.使用 urllib2 实现#! /usr/bin/env python# -*- coding=utf-8 -*- import urllib2url="h... [阅读全文]

Python常用base64 md5 aes des crc32加密解密方法汇总

1.base64python内置的base64模块可以实现base64、base32、base16、base85、urlsafe_base64的编码解码，pyt... [阅读全文]

Nuxt的路由动画效果案例

路由的动画效果，也叫作页面的更换效果。nuxt.js提动两种方法为路由提动动画效果，一种是全局的，一种是针对单独页面制作。全局路由动画全局动画默认使用page进... [阅读全文]

python如何编写类似nmap的扫描工具

本文主要是利用scapy包编写了一个简易扫描工具，支持arp、icmp、tcp、udp发现扫描，支持tcp syn、udp端口扫描，如下：usage: pyth... [阅读全文]

nuxt 路由、过渡特效、中间件的实现代码

在pages下的文件.vue文件会被自动加载成路由0、声明式导航<nuxt-link to="/">首页</nuxt-link>用法和r... [阅读全文]


验证码：

验证码：

Python爬取小说《斗罗大陆》

2020年07月20日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论