当前位置：移动技术网 > IT编程>脚本编程>Python > 终于我还是没忍住，用Python爬了一波女神

终于我还是没忍住，用Python爬了一波女神

2019年11月26日 | 移动技术网IT编程 | 我要评论

形考答案,东明论坛,n0499

你学爬虫，最终不还是为了爬妹子

啥也不说，开始福利赠送~

女神大会

不是知道有多少人知道“懂球帝”这个 app（网站），又有多少人关注过它的一个栏目“女神大会”，在这里，没有足球，只有女神哦。

画风是这样的

女神评分，全部是由球迷来决定，是不是很赤鸡，下面就一起来看看球迷眼中女神排名吧。

开工

获取 id 信息

首先，我们可以通过抓取懂球帝 app 的网络请求，拿到一个 api，

http://api.dongqiudi.com/search?keywords=type=all&page=

该 api ，我们能够拿到如下信息

我们主要关注 id 和 thumb，id 后面用来拼接女神所在页面的 html 地址，thumb 就用来收藏。

于是，我们就可以得到一个简单的解析函数

def get_list(page): nvshen_id_list = [] nvshen_id_picture = [] for i in range(1, page): print("获取第" + str(i) + "页数据") url = 'http://api.dongqiudi.com/search?keywords=%e5%a5%b3%e7%a5%9e%e5%a4%a7%e4%bc%9a&type=all&page=' + str(i) html = requests.get(url=url).text news = json.loads(html)['news'] if len(news) == 0: print("没有更多啦") break nvshen_id = [k['id'] for k in news] nvshen_id_list = nvshen_id_list + nvshen_id nvshen_id_picture = nvshen_id_picture + [{k['id']: k['thumb']} for k in news] time.sleep(1) return nvshen_id_list, nvshen_id_picture

下载 html 页面

接下来，通过观察，我们能够得到，每个女神所在的页面地址都是这样的，

https://www.dongqiudi.com/archive/**.html

其中 ** 就是上面拿到的 id 值，那么获取 html 页面的代码也就有了

def download_page(nvshen_id_list): for i in nvshen_id_list: print("正在下载id为" + i + "的html网页") url = 'https://www.dongqiudi.com/archive/%s.html' % i download = downloadpage() html = download.gethtml(url) download.savehtml(i, html) time.sleep(2) class downloadpage(object): def gethtml(self, url): html = requests.get(url=url).content return html def savehtml(self, file_name, file_content): with open('html_page/' + file_name + '.html', 'wb') as f: f.write(file_content)

防止访问限制，每次请求都做了2秒的等待

但是，问题来了

当我直接请求这个页面的时候，竟然是这样的

被（悲）拒（剧）了

没办法，继续斗争。重新分析，发现请求中有携带一个 cookie，哈哈，这个我们已经轻车熟路啦

对 requests 请求增加 cookie，同时再把 headers 里面增加个 user-agent，再试

成了！

解析本地 html

最后，就是解析下载到本地的 html 页面了，页面的规则就是，本期女神介绍页面，会公布上期女神的综合得分，而我们的主要任务就是获取各个女神的得分

def deal_loaclfile(nvshen_id_picture): files = os.listdir('html_page/') nvshen_list = [] special_page = [] for f in files: if f[-4:] == 'html' and not f.startswith('~'): htmlfile = open('html_page/' + f, 'r', encoding='utf-8').read() content = beautifulsoup(htmlfile, 'html.parser') try: tmp_list = [] nvshen_name = content.find(text=re.compile("上一期女神")) if nvshen_name is none: continue nvshen_name_new = re.findall(r"女神(.+?)，", nvshen_name) nvshen_count = re.findall(r"超过(.+?)人", nvshen_name) tmp_list.append(''.join(nvshen_name_new)) tmp_list.append(''.join(nvshen_count)) tmp_list.append(f[:-4]) tmp_score = content.find_all('span', attrs={'style': "color:#ff0000"}) tmp_score = list(filter(none, [k.string for k in tmp_score])) if '.' in tmp_score[0]: if len(tmp_score[0]) > 3: tmp_list.append(''.join(list(filter(str.isdigit, tmp_score[0].strip())))) nvshen_list = nvshen_list + get_picture(content, tmp_list, nvshen_id_picture) else: tmp_list.append(tmp_score[0]) nvshen_list = nvshen_list + get_picture(content, tmp_list, nvshen_id_picture) elif len(tmp_score) > 1: if '.' in tmp_score[1]: if len(tmp_score[1]) > 3: tmp_list.append(''.join(list(filter(str.isdigit, tmp_score[1].strip())))) nvshen_list = nvshen_list + get_picture(content, tmp_list, nvshen_id_picture) else: tmp_list.append(tmp_score[1]) nvshen_list = nvshen_list + get_picture(content, tmp_list, nvshen_id_picture) else: special_page.append(f) print("拿不到score的html：", f) else: special_page.append(f) print("拿不到score的html：", f) except: print("解析出错的html：", f) raise return nvshen_list, special_page def get_picture(c, t_list, n_id_p): print("进入get_picture函数:") nvshen_l = [] tmp_prev_id = c.find_all('a', attrs={"target": "_self"}) for j in tmp_prev_id: if '期' in j.string: href_list = j['href'].split('/') tmp_id = re.findall(r"\d+\.?\d*", href_list[-1]) if len(tmp_id) == 1: prev_nvshen_id = tmp_id[0] t_list.append(prev_nvshen_id) for n in n_id_p: for k, v in n.items(): if k == prev_nvshen_id: t_list.append(v) print("t_list", t_list) nvshen_l.append(t_list) print("get_picture函数结束") return nvshen_l

保存数据

对于我们最后解析出来的数据，我们直接保存到 csv 文件中，如果数据量比较大的话，还可以考虑保存到 mongodb 中。

def save_to_file(nvshen_list, filename): with open(filename + '.csv', 'w', encoding='utf-8') as output: output.write('name,count,score,weight_score,page_id,picture\n') for row in nvshen_list: try: weight = int(''.join(list(filter(str.isdigit, row[1])))) / 1000 weight_2 = float(row[2]) + float('%.2f' % weight) weight_score = float('%.2f' % weight_2) rowcsv = '{},{},{},{},{},{}'.format(row[0], row[1], row[3], weight_score, row[4], row[5]) output.write(rowcsv) output.write('\n') except: raise

对于女神的得分，又根据打分的人数，做了个加权分数

保存图片

def save_pic(url, nick_name): resp = requests.get(url) if not os.path.exists('picture'): os.mkdir('picture') if resp.status_code == 200: with open('picture' + f'/{nick_name}.jpg', 'wb') as f: f.write(resp.content)

直接从拿到的 thumb 地址中下载图片，并保存到本地。

做一些图

首先我们先做一个柱状图，看看排名前10和倒数前10的情况

可以看到，朱茵、石川恋和高圆圆位列三甲，而得分高达95+的女神也有7位之多。那么排名后10位的呢，自行看吧，有没有人感到有点扎心呢，哈哈哈。同时，也能够从打分的人数来看出，人气高的女神，普遍得分也不低哦。

不过，该排名目前只代表球迷心目中的榜单，不知道程序猿心中的榜单会是怎样的呢

词云

图片墙

流口水哦。

百度 api 评分

百度有免费的人脸检测 api，只要输入图片，就能够得到对应的人脸得分，还是非常方便的，感兴趣的小伙伴可以去官网看看哦。

我这里直接给出了我通过百度 api 得出的女神新得分，一起来看看吧

哈哈哈哈，ai 的评分，对于图片的依赖太高，纯属娱乐。

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

python dict乱码如何解决

定义字典并直接输出，结果输出结果中文是乱码展示d={'name':'lily','age':18,'sex':'女','no':1121}print d输出结果... [阅读全文]
如何写python的配置文件

一、创建配置文件在d盘建立一个配置文件，名字为：test.ini内容如下：[baseconf]host=127.0.0.1port=3306user=rootp... [阅读全文]
使用Python FastAPI构建Web服务的实现

fastapi 是一个使用 python 编写的 web 框架，还应用了 python asyncio 库中最新的优化。本文将会介绍如何搭建基于容器的开发环境，... [阅读全文]
Python过滤掉numpy.array中非nan数据实例

代码需要先导入pandasarr的数据类型为一维的np.arrayimport pandas as pdarr[~pd.isnull(arr)]补充知识：pyt... [阅读全文]
python求numpy中array按列非零元素的平均值案例

输入：numpy的array输出：一个一维的平均值arrayimport numpy as np def non_zero_mean(np_arr): exis... [阅读全文]
Python如何向SQLServer存储二进制图片

需求是需要用python往 sqlserver中的image类型字段中插入二进制图片核心代码，研究好几个小时的代码：安装pywin32，adodbapiimag... [阅读全文]
python numpy实现rolling滚动案例

相比较pandas，numpy并没有很直接的rolling方法，但是numpy 有一个技巧可以让numpy在c代码内部执行这种循环。这是通过添加一个与窗口大小相... [阅读全文]
python opencv 实现读取、显示、写入图像的方法

opencv是一个强大的图像处理和计算机视觉库，实现了很多实用算法，值得学习和深究下。opencv包安装·　　这里直接安装opencv-python包（非官方）... [阅读全文]
python thrift 实现单端口多服务的过程

thrift 是一种接口描述语言和二进制通信协议。以前也没接触过，最近有个项目需要建立自动化测试，这个项目之间的微服务都是通过 thrift 进行通信的，然后写... [阅读全文]
Python while true实现爬虫定时任务

记得以前的windows 任务定时是可以的正常使用的，今天试了下，发现不能正常使用了，任务计划总是挂起。接下来记录下python 爬虫定时任务的几种解决方法。今... [阅读全文]

网友评论


验证码：

终于我还是没忍住，用Python爬了一波女神

2019年11月26日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论