当前位置：移动技术网 > IT编程>脚本编程>Python > sklearn+nltk ——情感分析（积极、消极）

sklearn+nltk ——情感分析（积极、消极）

2019年08月25日 | 移动技术网IT编程 | 我要评论

冥界警局片尾曲,is武易,殴打女司机

转载：

步骤：

1 有标签的数据。数据：好评文本:pos_text.txt 差评文本:neg_text.txt

2 构造特征：词，双词搭配(bigrams)，比如“手机非常”，“非常好用”，“好用 !”这三个搭配作为分类的特征。以此类推，三词搭配(trigrams)，四词搭配都是可以被作为特征的.

3 特征降维：使用统计方法找到信息量丰富的特征。包括：词频(term frequency)、文档频率(document frequency)、互信息(pointwise mutual information)、信息熵(information entropy)、卡方统计(chi-square)等等。

4 特征表示：nltk——[ {“特征1”: true, “特征2”: true, “特征n”: true, }, 类标签 ]

5 构建分类器并预测：选出最佳算法后可以调整特征的数量来测试准确度。（1）用分类算法训练里面的训练集(training set)，得出分类器。（2）用分类器给开发测试集分类(dev-test set)，得出分类结果。（3）对比分类器给出的分类结果和人工标注的正确结果，给出分类器的准确度。

其中，nltk 主要负责处理特征提取(双词或多词搭配需要使用nltk 来做)和特征选择(需要nltk 提供的统计方法)。scikit-learn 主要负责分类算法，评价分类效果，进行分类等任务。

实验：

1.处理数据。str 是全部pos+neg的数据。类型是：str（）

def text():
    f1 = open('pos_text.txt','r') 
    f2 = open('neg_text.txt','r')
    line1 = f1.readline()
    line2 = f2.readline()
    str = ''
    while line1:
        str += line1
        line1 = f1.readline()
    while line2:
        str += line2
        line2 = f2.readline()
    f1.close()
    f2.close()
    return str

2.构建特征

#把单个词作为特征
def bag_of_words(words):
    d={}
    for word in words:
        d[word]=true
    return d

print(bag_of_words(text()[:5]))

{'除': true, '了': true, '电': true, '池': true, '不': true}

import nltk
from nltk.collocations import  bigramcollocationfinder
from nltk.metrics import  bigramassocmeasures

#把双个词作为特征，并使用卡方统计的方法，选择排名前1000的双词
def bigram(words,score_fn=bigramassocmeasures.chi_sq,n=1000):
    bigram_finder=bigramcollocationfinder.from_words(words)  #把文本变成双词搭配的形式
    bigrams = bigram_finder.nbest(score_fn,n)  #使用卡方统计的方法，选择排名前1000的双词
    newbigrams = [u+v for (u,v) in bigrams]  # bigrams知识个双词列表
    return bag_of_words(newbigrams)  #调用bag_of_words 变成{词：true}的字典

print(bigram(text()[:5],score_fn=bigramassocmeasures.chi_sq,n=1000))

{'了电': true, '池不': true, '电池': true, '除了': true}

#把单个词和双个词一起作为特征
def bigram_words(words,score_fn=bigramassocmeasures.chi_sq,n=1000):
    bigram_finder=bigramcollocationfinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn,n)
    newbigrams = [u+v for (u,v) in bigrams]
    
    word_dict = bag_of_words(words) #单个字的字典
    bigrams_dict = bag_of_words(newbigrams)#二元词组的字典
    word_dict.update(bigrams_dict)  #把字典bigrams_dict合并到字典word_dict中
    return word_dict

print(bigram_words(text()[:10],score_fn=bigramassocmeasures.chi_sq,n=1000))

{'除': true, '了': true, '电': true, '池': true, '不': true, '给': true, '力': true, ' ': true, '都': true, '很': true, ' 都': true, '不给': true, 
'了电': true, '力 ': true, '池不': true, '电池': true, '给力': true, '都很': true, '除了': true}

import jieba
#结巴分词作为特征

def read_file(filename):
    stop = [line.strip() for line in  open('stopword.txt','r',encoding='utf-8').readlines()]  #停用词
    f = open(filename,'r')
    line = f.readline()
    str = []
    while line:
        s = line.split('\t')#去掉换行符       #print('s:',s)#['……\n']       #print('s[0]:',s[0])#['……']
        fenci = jieba.cut(s[0],hmm=true)  #false默认值：精准模式  参数hmm=true时，就有了新词发现的能力
        str.append(list(set(fenci)-set(stop)))
        line = f.readline()
    return str  #str 是一个整个评论的列表了

print(read_file('pos_text.txt')[:2])

[['真的', '大屏', '僵尸', '好', '敢', '出色', '300', '14', '多买块', '都', '大战', '双核', '秒杀', '帮', '苹果', '一点', '一张', 'g11', '分辨率', '入手',
 '地方', '植物', '会', '\n', '值得', '16g', '请', '电池', '不给力', '才', '4', '2820', '拿下', '综合', '盖', '感觉', '回答', 'c6', '选', '10', '留言', '玩机',
 '大屏幕', '一起', '放', '照相', '3', '不在乎', '哥', '不是', '非常', '画面', '水果', '游戏', '本来', '再', '贵', '机子', '朋友', '之间', '果断', ' ', '不敢',
 'g14', '一会', '咨询', '差', '判定', '觉得', '小', '尽量', '办公室', '想', '高', '多天', '开后', '心动', '打算', '不会', '极品飞车', '纠结', '买', '玩', '很无语',
 '不到', '安徽', '阜阳', '老婆', '很', '块', '卡', '俩', '差不多', '价格', '带', '500w'], ['9', '希望', '能够', '电池', '很漂亮', '很棒', '屏幕', '不错', '寸',
 '几乎', '完机', '高', '性价比', '4', 'sense', '运行', '都', '值得', '现在', '烫手', '16', '2.3', '一点', '4.3', '长时间', '霸气', '行', '软件', '解决', '入手',
 '很', '确实', '瑕不掩瑜', '其实', ' ', '流畅', '兼容', '还', '3.0', '问题', '真机', '整体', '清晰', '机无', '\n']]

from nltk.probability import  freqdist,conditionalfreqdist
from nltk.metrics import  bigramassocmeasures

#获取信息量最高(前number个)的特征(卡方统计)

def jieba_feature(number):   
    poswords = []
    negwords = []
    for items in read_file('pos_text.txt'):#把集合的集合变成集合
        for item in items:
            poswords.append(item)
    for items in read_file('neg_text.txt'):
        for item in items:
            negwords.append(item)

    word_fd = freqdist() #可统计所有词的词频
    #freqdist中的键为单词，值为单词的出现总次数。实际上freqdist构造函数接受任意一个列表，
    #它会将列表中的重复项给统计起来，在本例中我们传入的其实就是一个文本的单词列表。
    
    cond_word_fd = conditionalfreqdist() #可统计积极文本中的词频和消极文本中的词频
    #条件频率分布是频率分布的集合，每个频率分布有一个不同的条件，这个条件通常是文本的类别。
    #条件频率分布需要处理的是配对列表，每对的形式是（条件，事件），在示例中条件为文体类别，事件为单词。
    #成员方法
    #conditions()，返回条件列表
    #tabulate(conditions, samples)，根据指定的条件和样本，打印条件频率分布表格
    #plot(conditions, samples)，根据给定的条件和样本，绘制条件频率分布图

    for word in poswords:
        word_fd[word] += 1
        cond_word_fd['pos'][word] += 1

    for word in negwords:
        word_fd[word] += 1
        cond_word_fd['neg'][word] += 1

    pos_word_count = cond_word_fd['pos'].n() #积极词的数量
    neg_word_count = cond_word_fd['neg'].n() #消极词的数量
    total_word_count = pos_word_count + neg_word_count

    word_scores = {}#包括了每个词和这个词的信息量

    for word, freq in word_fd.items():#word_fd={'word':count}
        pos_score = bigramassocmeasures.chi_sq(cond_word_fd['pos'][word],  (freq, pos_word_count), total_word_count) 
        #计算积极词的卡方统计量，这里也可以计算互信息等其它统计量.
        #卡方x2值描述了自变量与因变量之间的相关程度：x2值越大，相关程度也越大
        
        neg_score = bigramassocmeasures.chi_sq(cond_word_fd['neg'][word],  (freq, neg_word_count), total_word_count) 
        
        word_scores[word] = pos_score + neg_score #一个词的信息量等于积极卡方统计量加上消极卡方统计量

    best_vals = sorted(word_scores.items(), key=lambda item:item[1],  reverse=true)[:number] #把词按信息量倒序排序。number是特征的维度，是可以不断调整直至最优的
    best_words = set([w for w,s in best_vals])
     
    return dict([(word, true) for word in best_words])

#调整设置，分别从四种特征选取方式开展并比较效果

def build_features():
    feature = bag_of_words(text())#第一种：单个词
    #feature = bigram(text(),score_fn=bigramassocmeasures.chi_sq,n=500)#第二种：双词
    #feature = bigram_words(text(),score_fn=bigramassocmeasures.chi_sq,n=500)#第三种：单个词和双个词
    #feature = jieba_feature(300)#第四种：结巴分词

    posfeatures = []
    for items in read_file('pos_text.txt'):
        a = {}
        for item in items: #item是每一句的分词列表
            if item in feature.keys():
                a[item]='true'
        poswords = [a,'pos'] #为积极文本赋予"pos"
        posfeatures.append(poswords)
        
    negfeatures = []
    for items in read_file('neg_text.txt'):
        a = {}
        for item in items:
            if item in feature.keys():
                a[item]='true'
        negwords = [a,'neg'] #为消极文本赋予"neg"
        negfeatures.append(negwords)
        
    return posfeatures,negfeatures

#获得训练数据

posfeatures,negfeatures = build_features()

from random import shuffle
import sklearn
from nltk.classify.scikitlearn import  sklearnclassifier
from sklearn.svm import svc, linearsvc,  nusvc
from sklearn.naive_bayes import  multinomialnb, bernoullinb
from sklearn.linear_model import  logisticregression
from sklearn.metrics import  accuracy_score

shuffle(posfeatures) 
shuffle(negfeatures) #把文本的排列随机化 

train =  posfeatures[300:]+negfeatures[300:]#训练集(70%)
test = posfeatures[:300]+negfeatures[:300]#验证集(30%)

data,tag = zip(*test)#分离测试集合的数据和标签，便于验证和测试

def score(classifier):
    classifier = sklearnclassifier(classifier) 
    classifier.train(train) #训练分类器
    pred = classifier.classify_many(data) #给出预测的标签
    n = 0
    s = len(pred)
    for i in range(0,s):
        if pred[i]==tag[i]:
            n = n+1
    return n/s #分类器准确度

print('bernoullinb`s accuracy is %f' %score(bernoullinb()))
print('multinomianb`s accuracy is %f' %score(multinomialnb()))
print('logisticregression`s accuracy is %f' %score(logisticregression(solver='lbfgs')))
print('svc`s accuracy is %f' %score(svc(gamma='scale')))
print('linearsvc`s accuracy is %f' %score(linearsvc()))
#print('nusvc`s accuracy is %f' %score(nusvc()))

3.结果

# bernoullinb`s accuracy is 0.858333
# **** multinomianb`s accuracy is 0.871667*****
# logisticregression`s accuracy is 0.820000
# svc`s accuracy is 0.805000
# linearsvc`s accuracy is 0.795000
#第四种：结巴分词
# **** bernoullinb`s accuracy is 0.761667*****
# multinomianb`s accuracy is 0.701667
# logisticregression`s accuracy is 0.756667
# svc`s accuracy is 0.688333
# linearsvc`s accuracy is 0.733333
#第三种：单个词和双个词
# ***** bernoullinb`s accuracy is 0.773333******
# multinomianb`s accuracy is 0.688333
# logisticregression`s accuracy is 0.726667
# svc`s accuracy is 0.661667
# linearsvc`s accuracy is 0.726667
#第二种：双词
# bernoullinb`s accuracy is 0.641667
# multinomianb`s accuracy is 0.616667
#***** logisticregression`s accuracy is 0.668333*****
# svc`s accuracy is 0.545000
# linearsvc`s accuracy is 0.653333
#第一种：单个词

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

新手学习Python2和Python3中print不同的用法

在python2和python3中都提供print()方法来打印信息,但两个版本间的print稍微有差异主要体现在以下几个方面：1.python3中print是... [阅读全文]
Python基于os.environ从windows获取环境变量

安装python之后，我们往往面临这样一个问题，在命令行输入“python”，竟然出错，难道是没有安装成功吗？非也，其实是你的系统环境变量没有设置好。今天，小编... [阅读全文]
keras实现调用自己训练的模型,并去掉全连接层

其实很简单from keras.models import load_modelbase_model = load_model('model_resenet.h... [阅读全文]
python中def是做什么的

python使用def开始函数定义，紧接着是函数名，括号内部为函数的参数，内部为函数的具体功能实现代码，如果想要函数有返回值, 在expressions中的逻... [阅读全文]
Python xlwt模块使用代码实例

简介写入excle文档安装：pip3 install xlwt导入：import xlwtxlrd 模块方法写入案例import xlwt# 创建对象，设置编码... [阅读全文]
Keras之自定义损失(loss)函数用法说明

在keras中可以自定义损失函数，在自定义损失函数的过程中需要注意的一点是，损失函数的参数形式，这一点在keras中是固定的，须如下形式：def my_loss... [阅读全文]
Python xlrd模块导入过程及常用操作

简介读取excle文档，支持xls，xlsx格式安装：pip3 install xlrd导入：import xlrdxlrd 模块方法读取excelfile =... [阅读全文]
keras打印loss对权重的导数方式

notes怀疑模型梯度爆炸，想打印模型 loss 对各权重的导数看看。如果如果fit来训练的话，可以用keras.callbacks.tensorboard实现... [阅读全文]
keras 使用Lambda 快速新建层添加多个参数操作

keras许多简单操作，都需要新建一个层，使用lambda可以很好完成需求。# 额外参数def normal_reshape(x, shape): return... [阅读全文]
JAVA及PYTHON质数计算代码对比解析

java 实现class primenumber{public static void main(string[] args) {long start=syst... [阅读全文]

网友评论


验证码：

sklearn+nltk ——情感分析（积极、消极）

2019年08月25日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论