昨日有个中文词频统计的需求, 百度一番后, 发现一大堆标题党文章, 讲的与内容严重不符, 这里就简单记录下自己实现的流程吧!
与英文单词的词频统计不同, 中文的难点在于如何分词, 不过好在有许多优秀的现成库供调用,这里就使用了 ansj_seg 插件.
首先添加依赖:
<dependency> <groupid>org.ansj</groupid> <artifactid>ansj_seg</artifactid> <version>5.1.1</version> </dependency>
基本用法为:
string str = "欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!" ; system.out.println(toanalysis.parse(str)); 欢迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分词/n,),在/p,这里/r,如果/c,你/r,遇到/v,什么/r,问题/n,都/d,可以/v,联系/v,我/r,./m,我/r,一定/d,尽我所能/l,./m,帮助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,准/a,,,更/d,自由/a,!
下面就贴上代码:
public static void wordfrequency() throws ioexception { map<string, integer> map = new hashmap<>(); string article = getstring(); string result = toanalysis.parse(article).tostringwithoutnature(); string[] words = result.split(","); for(string word: words){ string str = word.trim(); // 过滤空白字符 if (str.equals("")) continue; // 过滤一些高频率的符号 else if(str.matches("[)|(|.|,|。|+|-|“|”|:|?|\\s]")) continue; // 此处过滤长度为1的str else if (str.length() < 2) continue; if (!map.containskey(word)){ map.put(word, 1); } else { int n = map.get(word); map.put(word, ++n); } } iterator<map.entry<string, integer>> iterator = map.entryset().iterator(); while (iterator.hasnext()){ map.entry<string, integer> entry = iterator.next(); system.out.println(entry.getkey() + ": " + entry.getvalue()); }
list<map.entry<string, integer>> list = new arraylist<>(); map.entry<string, integer> entry;
while ((entry = getmax(map)) != null){ list.add(entry); } system.out.println(arrays.tostring(list.toarray())); } /** * 找出map中value最大的entry, 返回此entry, 并在map删除此entry * @param map * @return */ public static map.entry<string, integer> getmax(map<string, integer> map){ if (map.size() == 0){ return null; } map.entry<string, integer> maxentry = null; boolean flag = false; iterator<map.entry<string, integer>> iterator = map.entryset().iterator(); while (iterator.hasnext()){ map.entry<string, integer> entry = iterator.next(); if (!flag){ maxentry = entry; flag = true; } if (entry.getvalue() > maxentry.getvalue()){ maxentry = entry; } } map.remove(maxentry.getkey()); return maxentry; } /** * 从文件中读取待分割的文章素材.
* 文件内容来自简书热门文章: https://www.jianshu.com/p/5b37403f6ba6 * @return * @throws ioexception */ public static string getstring() throws ioexception { fileinputstream inputstream = new fileinputstream(new file("/home/as_/ideaprojects/springmaven/article-txt")); bufferedreader reader = new bufferedreader(new inputstreamreader(inputstream)); stringbuilder strbuilder = new stringbuilder(); string line; while((line = reader.readline()) != null){ strbuilder.append(line); } reader.close(); inputstream.close(); return strbuilder.tostring(); }
最后依旧附上图片:
如对本文有疑问, 点击进行留言回复!!
Springcloud RestTemplate服务调用代码实例
用Maven打成可执行jar,包含maven依赖,本地依赖的操作
网友评论