当前位置：移动技术网 > IT编程>脚本编程>Python > 如何用python爬虫从爬取一章小说到爬取全站小说

如何用python爬虫从爬取一章小说到爬取全站小说

2020年03月29日 | 移动技术网IT编程 | 我要评论

朝雾メイサ,上位第2集,邪龙之脊

前言

文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

ps：如有需要python学习资料的小伙伴可以加点击下方链接自行获取http://t.cn/a6zvjdun

很多好看的小说只能看不能下载，教你怎么爬取一个网站的所有小说

知识点：

requests
xpath
全站小说爬取思路

开发环境：

版本：anaconda5.2.0（python3.6.5）
编辑器：pycharm

第三方库：

requests
parsel

进行网页分析

目标站点:

开发者工具的使用networkelement

爬取一章小说

requests库的使用（请求网页数据）
对请求网页数据步骤进行封装
css选择器的使用（解析网页数据）
操作文件（数据持久化）

# -*- coding: utf-8 -*-
import requests
import parsel
"""爬取一章小说"""
# 请求网页数据
headers = {
 'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/75.0.3770.142 safari/537.36'
}
response = requests.get('http://www.shuquge.com/txt/8659/2324752.html', headers=headers)
response.encoding = response.apparent_encoding
html = response.text
print(html)
# 从网页中提取内容
sel = parsel.selector(html)
title = sel.css('.content h1::text').extract_first()
contents = sel.css('#content::text').extract()
contents2 = []
for content in contents:
 contents2.append(content.strip())
print(contents)
print(contents2)
print("\n".join(contents2))
# 将内容写入文本
with open(title+'.txt', mode='w', encoding='utf-8') as f:
 f.write("\n".join(contents2))

爬取一本小说

对爬虫进行重构需要爬取很多章小说，最笨的方法是直接使用 for 循环。
爬取索引页需要爬取所有的章节，只要获取每一章的网址就行了。

import requests
import parsel
"""获取网页源代码"""
# 模拟浏览器发送请求
headers = {
 'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/75.0.3770.142 safari/537.36'
}
def download_one_chapter(target_url):
 # 需要请求的网址
 # target_url = 'http://www.shuquge.com/txt/8659/2324753.html'
 # response 服务返回的内容 对象
 # pycharm ctrl+鼠标左键
 response = requests.get(target_url, headers=headers)
 # 解码 万能解码
 response.encoding = response.apparent_encoding
 # 文字方法 获取网页文字内容
 # print(response.text)
 # 字符串
 html = response.text
 """从网页源代码里面拿到信息"""
 # 使用parsel 把字符串变成对象
 sel = parsel.selector(html)
 # scrapy
 # extract 提取标签的内容
 # 伪类选择器（选择属性） css选择器（选择标签）
 # 提取第一个内容
 title = sel.css('.content h1::text').extract_first()
 # 提取所有的内容
 contents = sel.css('#content::text').extract()
 print(title)
 print(contents)
 """ 数据清除 清除空白字符串 """
 # contents1 = []
 # for content in contents:
 # # 去除两端空白字符
 # # 字符串的操作 列表的操作
 # contents1.append(content.strip())
 #
 # print(contents1)
 # 列表推导式
 contents1 = [content.strip() for content in contents]
 print(contents1)
 # 把列表编程字符串
 text = '\n'.join(contents1)
 print(text)
 """保存小说内容"""
 # open 操作文件（写入、读取）
 file = open(title + '.txt', mode='w', encoding='utf-8')
 # 只能写入字符串
 file.write(title)
 file.write(text)
 # 关闭文件
 file.close()
# 传入一本小说的目录
def get_book_links(book_url):
 response = requests.get(book_url)
 response.encoding = response.apparent_encoding
 html = response.text
 sel = parsel.selector(html)
 links = sel.css('dd a::attr(href)').extract()
 return links
# 下载一本小说
def get_one_book(book_url):
 links = get_book_links(book_url)
 for link in links:
 print('http://www.shuquge.com/txt/8659/' + link)
 download_one_chapter('http://www.shuquge.com/txt/8659/' + link)
if __name__ == '__main__':
 # target_url = 'http://www.shuquge.com/txt/8659/2324754.html'
 # # 关键词参数与位置参数
 # download_one_chapter(target_url=target_url)
 # 下载别的小说 直接换url
 book_url = 'http://www.shuquge.com/txt/8659/'
 get_one_book(book_url)

爬取全站小说

如果想了解更多关于python的应用，可以私信小编

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python dict乱码如何解决

定义字典并直接输出，结果输出结果中文是乱码展示d={'name':'lily','age':18,'sex':'女','no':1121}print d输出结果... [阅读全文]
如何写python的配置文件

一、创建配置文件在d盘建立一个配置文件，名字为：test.ini内容如下：[baseconf]host=127.0.0.1port=3306user=rootp... [阅读全文]
使用Python FastAPI构建Web服务的实现

fastapi 是一个使用 python 编写的 web 框架，还应用了 python asyncio 库中最新的优化。本文将会介绍如何搭建基于容器的开发环境，... [阅读全文]
Python过滤掉numpy.array中非nan数据实例

代码需要先导入pandasarr的数据类型为一维的np.arrayimport pandas as pdarr[~pd.isnull(arr)]补充知识：pyt... [阅读全文]
python求numpy中array按列非零元素的平均值案例

输入：numpy的array输出：一个一维的平均值arrayimport numpy as np def non_zero_mean(np_arr): exis... [阅读全文]
Python如何向SQLServer存储二进制图片

需求是需要用python往 sqlserver中的image类型字段中插入二进制图片核心代码，研究好几个小时的代码：安装pywin32，adodbapiimag... [阅读全文]
python numpy实现rolling滚动案例

相比较pandas，numpy并没有很直接的rolling方法，但是numpy 有一个技巧可以让numpy在c代码内部执行这种循环。这是通过添加一个与窗口大小相... [阅读全文]
python opencv 实现读取、显示、写入图像的方法

opencv是一个强大的图像处理和计算机视觉库，实现了很多实用算法，值得学习和深究下。opencv包安装·　　这里直接安装opencv-python包（非官方）... [阅读全文]
python thrift 实现单端口多服务的过程

thrift 是一种接口描述语言和二进制通信协议。以前也没接触过，最近有个项目需要建立自动化测试，这个项目之间的微服务都是通过 thrift 进行通信的，然后写... [阅读全文]
Python while true实现爬虫定时任务

记得以前的windows 任务定时是可以的正常使用的，今天试了下，发现不能正常使用了，任务计划总是挂起。接下来记录下python 爬虫定时任务的几种解决方法。今... [阅读全文]

网友评论


验证码：

如何用python爬虫从爬取一章小说到爬取全站小说

2020年03月29日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论