当前位置：移动技术网 > IT编程>开发语言>.net > 机器学习框架ML.NET学习笔记【3】文本特征分析

机器学习框架ML.NET学习笔记【3】文本特征分析

2019年05月31日 | 移动技术网IT编程 | 我要评论

漳州,mellow高敏爱,薄溪来

一、要解决的问题

问题：常常一些单位或组织召开会议时需要录入会议记录，我们需要通过机器学习对用户输入的文本内容进行自动评判，合格或不合格。（同样的问题还类似垃圾短信检测、工作日志质量分析等。）

处理思路：我们人工对现有会议记录进行评判，标记合格或不合格，通过对这些记录的学习形成模型，学习算法仍采用二元分类的快速决策树算法，和上一篇文章不同，这次输入的特征值不再是浮点数，而是中文文本。这里就要涉及到文本特征提取。

为什么要进行文本特征提取呢？因为文本是人类的语言，符号文字序列不能直接传递给算法。而计算机程序算法只接受具有固定长度的数字矩阵特征向量(float或float数组)，无法理解可变长度的文本文档。

常用的文本特征提取方法有如下几种：

以上只是需要了解大致的含义，我们不需要去实现一个文本特征提取的算法，只需要使用平台自带的方法就可以了。

系统自带的文本特征处理的方法，输入是一个字符串，要求将一个语句中的词语用空格分开，英语的句子中词汇是天生通过空格分割的，但中文句子不是，所以我们需要首先进行分词操作，具体流程如下：

二、代码

代码整体流程和上一篇文章描述的基本一致，为简便起见，我们省略了模型存储和读取的过程。

先看一下数据集：

代码如下：

namespace binaryclassification_textfeaturize
{
    class program
    {
        static readonly string datapath = path.combine(environment.currentdirectory, "data", "meeting_data_full.csv");

        static void main(string[] args)
        {
            mlcontext mlcontext = new mlcontext();
            var fulldata = mlcontext.data.loadfromtextfile<meetinginfo>(datapath, separatorchar: ',', hasheader: false);
            var traintestdata = mlcontext.data.traintestsplit(fulldata, testfraction: 0.15);
            var traindata = traintestdata.trainset;
            var testdata = traintestdata.testset;

            var trainingpipeline = mlcontext.transforms.custommapping<jiebalambdainput, jiebalambdaoutput>(mapaction: jiebalambda.myaction, contractname: "jiebalambda")
                .append(mlcontext.transforms.text.featurizetext(outputcolumnname: "features", inputcolumnname: "jiebatext"))
                .append(mlcontext.binaryclassification.trainers.fasttree(labelcolumnname: "label", featurecolumnname: "features"));
            itransformer trainedmodel = trainingpipeline.fit(traindata);

            
            //评估
            var predictions = trainedmodel.transform(testdata);           
            var metrics = mlcontext.binaryclassification.evaluate(data: predictions, labelcolumnname: "label");
            console.writeline($"evalution accuracy: {metrics.accuracy:p2}");
           

            //创建预测引擎
            var predengine = mlcontext.model.createpredictionengine<meetinginfo, predictionresult>(trainedmodel);

            //预测1
            meetinginfo samplestatement1 = new meetinginfo { text = "支委会。" };
            var predictionresult1 = predengine.predict(samplestatement1);
            console.writeline($"{samplestatement1.text}:{predictionresult1.predictedlabel}");         

            //预测2
            meetinginfo samplestatement2 = new meetinginfo { text = "开展新时代中国特色社会主义思想三十讲党员答题活动。" };
            var predictionresult2 = predengine.predict(samplestatement2);
            console.writeline($"{samplestatement2.text}:{predictionresult2.predictedlabel}");        

            console.writeline("press any to exit!");
            console.readkey();
        }
        
    }

    public class meetinginfo
    {
        [loadcolumn(0)]
        public bool label { get; set; }
        [loadcolumn(1)]
        public string text { get; set; }
    }

    public class predictionresult : meetinginfo
    {
        public string jiebatext { get; set; }
        public float[] features { get; set; }
        public bool predictedlabel;
        public float score;
        public float probability;        
    }
}

三、代码分析

和上一篇文章中相似的内容我就不再重复解释了，重点介绍一下学习管道的建立。

var trainingpipeline = mlcontext.transforms.custommapping<jiebalambdainput, jiebalambdaoutput>(mapaction: jiebalambda.myaction, contractname: "jiebalambda")
    .append(mlcontext.transforms.text.featurizetext(outputcolumnname: "features", inputcolumnname: "jiebatext"))
    .append(mlcontext.binaryclassification.trainers.fasttree(labelcolumnname: "label", featurecolumnname: "features"));

首先，在进行文本特征转换之前，我们需要对文本进行分词操作，您可以对样本数据进行预处理，形成分词的结果再进行学习，我们没有采用这个方法，而是自定义了一个分词处理的数据处理管道，通过这个管道进行分词，其定义如下：

namespace binaryclassification_textfeaturize
{
    public class jiebalambdainput
    {
        public string text { get; set; }
    }

    public class jiebalambdaoutput
    {
        public string jiebatext { get; set; }
    }

    public class jiebalambda
    {       
        public static void myaction(jiebalambdainput input, jiebalambdaoutput output)
        {
            jiebanet.segmenter.jiebasegmenter jiebasegmenter = new jiebanet.segmenter.jiebasegmenter();
            output.jiebatext = string.join(" ", jiebasegmenter.cut(input.text));          
        }        
    }
}

最后我们新建了两个对象进行实际预测：

            //预测1
            meetinginfo samplestatement1 = new meetinginfo { text = "支委会。" };
            var predictionresult1 = predengine.predict(samplestatement1);
            console.writeline($"{samplestatement1.text}:{predictionresult1.predictedlabel}");         

            //预测2
            meetinginfo samplestatement2 = new meetinginfo { text = "开展新时代中国特色社会主义思想三十讲党员答题活动。" };
            var predictionresult2 = predengine.predict(samplestatement2);
            console.writeline($"{samplestatement2.text}:{predictionresult2.predictedlabel}");

预测结果如下：

四、调试

上一篇文章提到，当我们运行transform方法时，会对所有记录进行转换，转换后的数据集是什么样子呢，我们可以写一个调试程序看一下。

        var predictions = trainedmodel.transform(testdata);
        debugdata(mlcontext, predictions);

        private static void debugdata(mlcontext mlcontext, idataview predictions)
        {
            var traindatashow = new list<predictionresult>(mlcontext.data.createenumerable<predictionresult>(predictions, false, true));

            foreach (var dataline in traindatashow)
            {
                dataline.printtoconsole();
            }
        }

    public class predictionresult 
    {
        public string jiebatext { get; set; }
        public float[] features { get; set; }
        public bool predictedlabel;
        public float score;
        public float probability;
        public void printtoconsole()
        {
            console.writeline($"jiebatext={jiebatext}");
            console.writeline($"predictedlabel:{predictedlabel},score:{score},probability:{probability}");
            console.writeline($"textfeatures length:{features.length}");
            if (features != null)
            {
                foreach (var f in features)
                {
                    console.write($"{f},");
                }
                console.writeline();
            }
            console.writeline();
        }
    }

通过对调试结果的分析，可以看到整个数据处理管道的工作流程。

五、资源获取

源码下载地址：https://github.com/seabluescn/study_ml.net

工程名称：binaryclassification_textfeaturize

点击查看机器学习框架ml.net学习笔记系列文章目录

您可能感兴趣的文章:

如对本文有疑问，请在下面进行留言讨论，广大热心网友会与你互动！！点击进行留言回复

微信退款（在.net core 用http方式请求）

微信JSAPI支付申请退款接口地址接口链接：https://api.mch.weixin.qq.com/secapi/pay/refund 是否需... [阅读全文]
Owin Katana 的底层源码分析

最近看了一下开源项目asp.net katana，感觉公开的接口非常的简洁优雅，channel 9 说是受到node.js的启发设计的，Katana是一... [阅读全文]
jenkins发布application且并运行

一、发布配置差异配置：编译内容编译目标NetWorkClient/KJ90NetClient.csproj编译命令/t:build/p:Configur... [阅读全文]
WPF 简易日期控件魔改ListBox

先上截图修正：应该将SetTime方法修改为，行号为207行开始修改 var nk = Day_of_week(year, month, 1); i... [阅读全文]
DevExpress+Winform（二）

无敌模糊学习视频：https://www.bilibili.com/video/BV15x411x7WN?p=3 第三集：实现一个页面，新建devexp... [阅读全文]
DevExpress+Winform（三）

第四讲：添加GridControl，一个GridControl可以对应多个展示数据View，默认会有一个GridView。设置ShowGroupPan... [阅读全文]
docker部署netcore项目 nginx负载均衡

前言：本文主要内容是docker部署netcore应用以及docker运行nginx实现负载均衡。到目前为止感觉微软在跨平台的方面虽然有较大的进步，但... [阅读全文]
基于微信个人收款码的支付接口的实现与源码

前言如果我们希望为自己的网站增添微信扫码收款功能，用于收取一些服务费用，为个人网站提供自动化有偿服务的话，那我们有哪些方案呢？首先，我们先看下效果... [阅读全文]
asp.net搭建博客，使用BlogEngine.NET+MySql搭建博客

起因：github日推邮件中来了个BlogEngine.NET，出于好奇然后就点击链接进去查看了下，发现这TM的太适合.net新手搭建博客网站了。以前的... [阅读全文]
JWT 使用加密算法RS256 非对称加密解密

参考文档： https://gist.github.com/ssippe/8fc11c4d7e766e66f06db0431dba3f0ahttps:/... [阅读全文]

网友评论


验证码：

机器学习框架ML.NET学习笔记【3】文本特征分析

2019年05月31日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论