当前位置：移动技术网 > IT编程>开发语言>.net > 训练中文词向量

训练中文词向量

2020年07月07日 | 移动技术网IT编程 | 我要评论

预处理

将dat转换为txt文件

import chardet
import codecs
import re
 
# detect file encode type 
file_path = '/home/ricardo/out/news_sohusite_xml.dat'
# read file
f2 = codecs.open(file_path, encoding='GB2312', errors="ignore")
content2 = f2.read();
f2.close()
# write to text file
f = codecs.open('/home/ricardo/out/news_sohusite_xml.txt', 'w',encoding='utf8');
# exact the text between <content> and  </content>
a = re.findall('<content>.*</content>', content2)
print("Length of list: %d" % len(a))
i = 0;
for item in a:
    b = item.replace('<content>','');
    b = b.replace('</content>','');
    f.write(str(b)+'\n');
    i = i+1;
    if i%1000 == 0:
        print("index: %d / %d" % (i,len(a)));
    
f.close();

去停用词、分词

import jieba
jieba.enable_parallel()
# 创建停用词列表
def stopwordslist():
    stopwords = [line.strip() for line in open('/home/ricardo/stopwords/hit_stopwords.txt',encoding='UTF-8').readlines()]
    return stopwords

# 对句子进行中文分词
def seg_depart(sentence):
    # 对文档中的每一行进行中文分词
    print("正在分词")
    sentence_depart = jieba.cut(sentence.strip())
    # 创建一个停用词列表
    stopwords = stopwordslist()
    # 输出结果为outstr
    outstr = ''
    # 去停用词
    for word in sentence_depart:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr

# 给出文档路径
filename = "/home/ricardo/out/1.txt"
outfilename = "/home/ricardo/outout.txt"
inputs = open(filename, 'r', encoding='UTF-8')
outputs = open(outfilename, 'w', encoding='UTF-8')

# 将输出结果写入ou.txt中
for line in inputs:
    line_seg = seg_depart(line)
    outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

训练

from gensim.models import word2vec
import multiprocessing
 
def train_wordVectors(sentences, embedding_size = 128, window = 5, min_count = 5):
    w2vModel = word2vec.Word2Vec(sentences, size=embedding_size, window=window, min_count=min_count,workers=multiprocessing.cpu_count())
    return w2vModel
 
def save_wordVectors(w2vModel,word2vec_path):
    w2vModel.save(word2vec_path)
 
def load_wordVectors(word2vec_path):
    w2vModel = word2vec.Word2Vec.load(word2vec_path)
    return w2vModel
 
if __name__=='__main__':
 
    # 若只有一个文件，使用LineSentence读取文件
    sentences = word2vec.LineSentence('/home/ricardo/out.txt')
 
    # 若存在多文件，使用PathLineSentences读取文件列表
 
    #segment_dir='/words/'
    #sentences = word2vec.PathLineSentences(segment_dir)
 
    # 一般训练，设置以下几个参数即可：
    word2vec_path='/home/ricardo/word2Vec.model'
    model2=train_wordVectors(sentences, embedding_size=128, window=5, min_count=5)
    save_wordVectors(model2,word2vec_path)
    model2=load_wordVectors(word2vec_path)
    print(model2.wv.similarity('你好', '您好'))

本文地址：https://blog.csdn.net/Ricardo98/article/details/107157024

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

《UnityAPI.Particle粒子》（Yanlz+Unity+SteamVR+云技术+5G+AI+VR云游戏+Particle+lifetime+startColor+立钻哥哥++OK++）

《UnityAPI.Particle粒子》版本作者参与者完成日期备注UnityAPI_Particle_V01_... [阅读全文]
刷脸支付掀起了市场的新风向

随着科技的发展，越来越多的异想天开出现在了人们的日常生活中，扫码支付、刷脸支付、无感支付，其中，刷脸支付的出现让... [阅读全文]
8寸扫码三防军工平板手持终端，双色注塑模/康宁大猩猩玻璃/10点电容屏

I86HT86Q86产品形态（同套模）八寸Windows三防平板八寸安卓三防平板外观尺寸228*145*16.5... [阅读全文]
信创舆情一线--十五部门印发指导意见进一步促进服务型制造发展

一、舆情回顾工信部等十五部门近日联合印发《关于进一步促进服务型制造发展的指导意见》（下称《指导意见》）。提出积极... [阅读全文]
OSI物理层之传输媒体

文章目录导向传输媒体非导向传输媒体物理层设备---集线器导向传输媒体导向传输媒体中，电磁波沿着固体媒体传播。双绞... [阅读全文]
应用于网络摄像头的域格4G嵌入模组

CLR950模块示意图：CLR950是域格4G模组嵌入版：应用场景：网络摄像头CLR950模组尺寸为：38mm×... [阅读全文]
工业DTU应用于无线输电线路监测系统

我国地域辽阔，输电线路长、监测困难。且迅速增长的输电线路对线路运行人员的巡视维护工作，及周边环境情况信息实时性要... [阅读全文]
Napatech网络加速卡

1.背景越来越依赖互联网更复杂的服务和用户群体海量数据更快的网络速度——现在是100G所有的服务都转移到云端企业... [阅读全文]
机井控制系统是由什么组成的？

机井控制系统主要实现对机井灌溉运行系统的实时监控，设备包括机井控制器、流量计、水位变送器、智能电表和水泵控制柜等... [阅读全文]
曝三星Note20 Ultra为了续航支持自适应刷新率

近日，三星宣布将在8月5日举行新产品发布会。据悉，三星将带来一系列新产品，包括三星Note20系列，三星Z Fl... [阅读全文]