当前位置：移动技术网 > IT编程>脚本编程>Python > 简单微博爬取以及分析

简单微博爬取以及分析

2020年07月21日 | 移动技术网IT编程 | 我要评论

对微博评论进行简单爬取并进行分析
任务：
1.爬取评论和时间（request和re）
2.词频统计（jieba）
3.词云展示（wordcloud）
4.时间分布（matplotlib）
代码如下：

#heheyang
import requests
import re
import jieba
import wordcloud
import time as ti
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


#数据爬取
start_url='weibo_path&&page='
header = {'cookie':'your cookie',
          'user-agent':'your agent'}
f=open('path','w',encoding='utf-8')
#f.write('time\tcomment\n')
all_comment=''
all_time=[]
for i in range(1,51 ):
    #print('正在爬取第%d页....' %i)
    url=start_url+str(i)
    r = requests.get(url,timeout=30,headers = header)
    r.raise_for_status()
    r.encoding='utf-8'
    if i==1:
        comment=re.findall('<span class="ctt">(.*?)</span>',r.text)
        time=re.findall('<span class="ct">(.*?)&nbsp',r.text)
        del comment[0:4]
        del time[0:4]
    else:
        comment = re.findall('<span class="ctt">(.*?)</span>', r.text)
        time = re.findall('<span class="ct">(.*?)&nbsp', r.text)
    for i in range(len(comment)):
        f.write(time[i]+'\t')
        f.write(comment[i]+'\n')
        all_comment+=comment[i]
        all_time.append(time[i])
f.close()
#print(all_comment)
#print(all_time)
#评论词云展示
pattern="[\u4e00-\u9fa5]+"
regex = re.compile(pattern)
comment_chinese=regex.findall(all_comment)
text=''
for i in comment_chinese:
    text+=i
#词频统计
words=jieba.lcut(text)
counts={}
for word in words:
    if word not in counts:
        counts[word]=1
    else:
        counts[word]+=1
words_counts=list(counts.items())
words_counts.sort(key=lambda x:x[1],reverse=True)
#格式输出
tplt="{0:^10}\t{1:^10}\t"
print(tplt.format("word", "      count"))
for i in range(10):
    word,count=words_counts[i]
    #输出前10
    print(tplt.format(word,count))
#词云展示
cut=' '.join(words)
w=wordcloud.WordCloud(font_path='msyh.ttc',collocations=False,height=600, width=1000,background_color='white')
w.generate(cut)
w.to_file('word.png')

#热度分析
time_list=[]
for i in all_time:
    i=i.replace('月','-')
    i=i.replace('日','')
    i='2020-'+i+':00'
    ts=int(ti.mktime(ti.strptime(''.join(i), "%Y-%m-%d %H:%M:%S")))#转换为时间戳
    time_list.append(ts)
dict={
    'timeStamp':time_list
}
df=pd.DataFrame(dict)
mean=df['timeStamp'].mean()
std=df['timeStamp'].std()
x=np.arange(1591889940,1593856680,0.1)
y=np.exp(-((x-mean) ** 2) / (2 * std ** 2)) / (std * np.sqrt(2 * np.pi))
plt.plot(x,y)
#plt.hist(df['timeStamp'], bins=12, rwidth=0.9, density=True)
plt.title('time distribution')
plt.xlabel('Time')
plt.ylabel('Attention to events')
plt.savefig('time distribution.png')
plt.show()

爬虫写得不多，小白学习ing，欢迎大家交流…

本文地址：https://blog.csdn.net/heheyangxyy/article/details/107475931

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

第二章如何利用Python读取Oracle表数据和表头转化为字典类型

第一章跟大家说了如何链接Oracle，这边就不多说了，那就开始接下来的操作目的: 把表头与字段值转化为字典的形式... [阅读全文]
python爱的魔力绕绕圈--条件和循环

概要情况python前置基础再复习，内容包括条件语句、循环语句、异常处理等具体代码及要点见下方：（另：pytho... [阅读全文]
荐 Python基础知识（一）：变量与赋值、运算符、数据类型及位运算

学习目标了解python中基本的变量类型，运算符，及数据类型。了解python的位运算1 注释在 Python ... [阅读全文]
荐数据可视化与文本分类_CodingPark编程公园

@Python [阅读全文]
python漫画爬虫:我不做人了，b站！爬取辉夜大小姐等漫画

今天我们要爬取这个网站的《辉夜大小姐想让我告白》漫画（穷人靠科技，富人靠硬币，懂，不多说）首先我们找到了每一话的... [阅读全文]
SlugRelatedField自动创建关联表对象

SlugRelatedField自动创建关联表对象例如有以下模型类,此处外键设置时blank和null约束条件必... [阅读全文]
使用python turtle库13行代码实现奥运五环

网上看到很多关于画奥运五环的python代码，但大多是采用一个个地画地方式，整体代码显得很累赘，故此我使用for... [阅读全文]
手把手教物体检测——YOLOV4（pytorch）

摘要 YOLOV4在coco上面达到了43.5%AP ，在Tesla V100 上达到了65FPS。相比今年的其... [阅读全文]
新手入住python该如何编程6

6.1鼠标点击事件我们可以使用【when_sprite_clicked】语句来处理角色的点击事件。这个语句需要两... [阅读全文]
【LeeCode 中等数学 python3】剑指 Offer 43. 1～n整数中1出现的次数

想要看更加舒服的排版、更加准时的推送关注公众号“不太灵光的程序员”每日八点有干货推送，微信随时解答你的疑问剑指 ... [阅读全文]

网友评论


验证码：

简单微博爬取以及分析

2020年07月21日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论