当前位置: 移动技术网 > IT编程>开发语言>.net > 机器学习框架ML.NET学习笔记【9】自动学习

机器学习框架ML.NET学习笔记【9】自动学习

2019年06月11日  | 移动技术网IT编程  | 我要评论

永丰租房网,egglomania,海贼王漫画854

一、概述

本篇我们首先通过回归算法实现一个葡萄酒品质预测的程序,然后通过automl的方法再重新实现,通过对比两种实现方式来学习automl的应用。

首先数据集来自于竞赛网站kaggle.com的uci wine quality dataset数据集,访问地址:https://www.kaggle.com/c/uci-wine-quality-dataset/data

 该数据集,输入为一些葡萄酒的化学检测数据,比如酒精度等,输出为品酒师的打分,具体字段描述如下:

data fields
input variables (based on physicochemical tests): 
1 - fixed acidity 
2 - volatile acidity 
3 - citric acid 
4 - residual sugar 
5 - chlorides 
6 - free sulfur dioxide 
7 - total sulfur dioxide 
8 - density 
9 - ph 
10 - sulphates 
11 - alcohol

output variable (based on sensory data): 
12 - quality (score between 0 and 10)

other:
13 - id (unique id for each sample, needed for submission)

   

二、代码

namespace regression_winequality
{
    public class winedata
    {
        [loadcolumn(0)]
        public float fixedacidity;

        [loadcolumn(1)]
        public float volatileacidity;

        [loadcolumn(2)]
        public float citricacid;

        [loadcolumn(3)]
        public float residualsugar;

        [loadcolumn(4)]
        public float chlorides;

        [loadcolumn(5)]
        public float freesulfurdioxide;

        [loadcolumn(6)]
        public float totalsulfurdioxide;

        [loadcolumn(7)]
        public float density;

        [loadcolumn(8)]
        public float ph;

        [loadcolumn(9)]
        public float sulphates;

        [loadcolumn(10)]
        public float alcohol;
      
        [loadcolumn(11)]
        [columnname("label")]
        public float quality;
       
        [loadcolumn(12)]
        public float id;
    }

    public class wineprediction
    {
        [columnname("score")]
        public float predictionquality;
    }

    class program
    {
        static readonly string modelfilepath = path.combine(environment.currentdirectory, "mlmodel", "model.zip");

        static void main(string[] args)
        { 
            train();
            prediction();

            console.writeline("hit any key to finish the app");
            console.readkey();
        }

        public static void train()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            // 准备数据
            string traindatapath = path.combine(environment.currentdirectory, "data", "winequality-data-full.csv");
            var fulldata = mlcontext.data.loadfromtextfile<winedata>(path: traindatapath, separatorchar: ',', hasheader: true);

            var traintestdata = mlcontext.data.traintestsplit(fulldata, testfraction: 0.2);
            var traindata = traintestdata.trainset;
            var testdata = traintestdata.testset;

            // 创建学习管道并通过训练数据调整模型  
            var dataprocesspipeline = mlcontext.transforms.dropcolumns("id")
                .append(mlcontext.transforms.normalizemeanvariance(nameof(winedata.freesulfurdioxide)))
                .append(mlcontext.transforms.normalizemeanvariance(nameof(winedata.totalsulfurdioxide)))
                .append(mlcontext.transforms.concatenate("features", new string[] { nameof(winedata.fixedacidity),
                                                                                    nameof(winedata.volatileacidity),
                                                                                    nameof(winedata.citricacid),
                                                                                    nameof(winedata.residualsugar),
                                                                                    nameof(winedata.chlorides),
                                                                                    nameof(winedata.freesulfurdioxide),
                                                                                    nameof(winedata.totalsulfurdioxide),
                                                                                    nameof(winedata.density),
                                                                                    nameof(winedata.ph),
                                                                                    nameof(winedata.sulphates),
                                                                                    nameof(winedata.alcohol)}));

            var trainer = mlcontext.regression.trainers.lbfgspoissonregression(labelcolumnname: "label", featurecolumnname: "features");
            var trainingpipeline = dataprocesspipeline.append(trainer);
            var trainedmodel = trainingpipeline.fit(traindata);

            // 评估
            var predictions = trainedmodel.transform(testdata);
            var metrics = mlcontext.regression.evaluate(predictions, labelcolumnname: "label", scorecolumnname: "score");
            printregressionmetrics(trainer.tostring(), metrics);

            // 保存模型
            console.writeline("====== save model to local file =========");
            mlcontext.model.save(trainedmodel, traindata.schema, modelfilepath);
        }

        static void prediction()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            itransformer loadedmodel = mlcontext.model.load(modelfilepath, out var modelinputschema);
            var predictor = mlcontext.model.createpredictionengine<winedata, wineprediction>(loadedmodel);

            winedata winedata = new winedata
            {
                fixedacidity = 7.6f,
                volatileacidity = 0.33f,
                citricacid = 0.36f,
                residualsugar = 2.1f,
                chlorides = 0.034f,
                freesulfurdioxide = 26f,
                totalsulfurdioxide = 172f,
                density = 0.9944f,
                ph = 3.42f,
                sulphates = 0.48f,
                alcohol = 10.5f
            };

            var winequality = predictor.predict(winedata);
            console.writeline($"wine data  quality is:{winequality.predictionquality} ");           
        }        
    }
}

 关于泊松回归的算法,我们在进行人脸颜值判断的那篇文章已经介绍过了,这个程序没有涉及任何新的知识点,就不重复解释了,主要目的是和下面的automl代码对比用的。 

 

三、自动学习

我们发现机器学习的大致流程基本都差不多,如:准备数据-明确特征-选择算法-训练等,有时我们存在这样一个问题:该选择什么算法?算法的参数该如何配置?等等。而自动学习就解决了这个问题,框架会多次重复数据选择、算法选择、参数调优、评估结果这一过程,通过这个过程找出评估效果最好的模型。

全部代码如下:

namespace regression_winequality
{
    public class winedata
    {
        [loadcolumn(0)]
        public float fixedacidity;

        [loadcolumn(1)]
        public float volatileacidity;

        [loadcolumn(2)]
        public float citricacid;

        [loadcolumn(3)]
        public float residualsugar;

        [loadcolumn(4)]
        public float chlorides;

        [loadcolumn(5)]
        public float freesulfurdioxide;

        [loadcolumn(6)]
        public float totalsulfurdioxide;

        [loadcolumn(7)]
        public float density;

        [loadcolumn(8)]
        public float ph;

        [loadcolumn(9)]
        public float sulphates;

        [loadcolumn(10)]
        public float alcohol;
      
        [loadcolumn(11)]
        [columnname("label")]
        public float quality;

        [loadcolumn(12)]       
        public float id; 
    }

    public class wineprediction
    {
        [columnname("score")]
        public float predictionquality;
    }
 

    class program
    {
        static readonly string modelfilepath = path.combine(environment.currentdirectory, "mlmodel", "model.zip");
        static readonly string traindatapath = path.combine(environment.currentdirectory, "data", "winequality-data-train.csv");
        static readonly string testdatapath = path.combine(environment.currentdirectory, "data", "winequality-data-test.csv");

        static void main(string[] args)
        {           
            trainandsave();
            loadandprediction();

            console.writeline("hit any key to finish the app");
            console.readkey();
        }

        public static void trainandsave()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            // 准备数据 
            var traindata = mlcontext.data.loadfromtextfile<winedata>(path: traindatapath, separatorchar: ',', hasheader: true);
            var testdata = mlcontext.data.loadfromtextfile<winedata>(path: testdatapath, separatorchar: ',', hasheader: true);
         
            var progresshandler = new regressionexperimentprogresshandler();
            uint experimenttime = 200;

            experimentresult<regressionmetrics> experimentresult = mlcontext.auto()
               .createregressionexperiment(experimenttime)
               .execute(traindata, "label", progresshandler: progresshandler);           

            debugger.printtopmodels(experimentresult);

            rundetail<regressionmetrics> best = experimentresult.bestrun;
            itransformer trainedmodel = best.model;

            // 评估 bestrun
            var predictions = trainedmodel.transform(testdata);
            var metrics = mlcontext.regression.evaluate(predictions, labelcolumnname: "label", scorecolumnname: "score");
            debugger.printregressionmetrics(best.trainername, metrics);

            // 保存模型
            console.writeline("====== save model to local file =========");
            mlcontext.model.save(trainedmodel, traindata.schema, modelfilepath);           
        }
       

        static void loadandprediction()
        {
            mlcontext mlcontext = new mlcontext(seed: 1);

            itransformer loadedmodel = mlcontext.model.load(modelfilepath, out var modelinputschema);
            var predictor = mlcontext.model.createpredictionengine<winedata, wineprediction>(loadedmodel);

            winedata winedata = new winedata
            {
                fixedacidity = 7.6f,
                volatileacidity = 0.33f,
                citricacid = 0.36f,
                residualsugar = 2.1f,
                chlorides = 0.034f,
                freesulfurdioxide = 26f,
                totalsulfurdioxide = 172f,
                density = 0.9944f,
                ph = 3.42f,
                sulphates = 0.48f,
                alcohol = 10.5f
            };

            var winequality = predictor.predict(winedata);
            console.writeline($"wine data  quality is:{winequality.predictionquality} ");           
        }
    }
}

  

四、代码分析

1、自动学习过程

            var progresshandler = new regressionexperimentprogresshandler();
            uint experimenttime = 200;

            experimentresult<regressionmetrics> experimentresult = mlcontext.auto()
               .createregressionexperiment(experimenttime)
               .execute(traindata, "label", progresshandler: progresshandler);           

            debugger.printtopmodels(experimentresult); //打印所有模型数据

  experimenttime 是允许的试验时间,progresshandler是一个报告程序,当每完成一种学习,系统就会调用一次报告事件。

    public class regressionexperimentprogresshandler : iprogress<rundetail<regressionmetrics>>
    {
        private int _iterationindex;

        public void report(rundetail<regressionmetrics> iterationresult)
        {
            _iterationindex++;
            console.writeline($"report index:{_iterationindex},trainername:{iterationresult.trainername},runtimeinseconds:{iterationresult.runtimeinseconds}");            
        }
    }

 调试结果如下:

report index:1,trainername:sdcaregression,runtimeinseconds:12.5244426
report index:2,trainername:lightgbmregression,runtimeinseconds:11.2034988
report index:3,trainername:fasttreeregression,runtimeinseconds:14.810409
report index:4,trainername:fasttreetweedieregression,runtimeinseconds:14.7338553
report index:5,trainername:fastforestregression,runtimeinseconds:15.6224459
report index:6,trainername:lbfgspoissonregression,runtimeinseconds:11.1668197
report index:7,trainername:onlinegradientdescentregression,runtimeinseconds:10.5353
report index:8,trainername:olsregression,runtimeinseconds:10.8905459
report index:9,trainername:lightgbmregression,runtimeinseconds:10.5703296
report index:10,trainername:fasttreeregression,runtimeinseconds:19.4470509
report index:11,trainername:fasttreetweedieregression,runtimeinseconds:63.638882
report index:12,trainername:lightgbmregression,runtimeinseconds:10.7710518

学习结束后我们通过debugger.printtopmodels打印出所有模型数据: 

   public class debugger
    {
        private const int width = 114;
        public  static void printtopmodels(experimentresult<regressionmetrics> experimentresult)
        {            
            var topruns = experimentresult.rundetails
                .where(r => r.validationmetrics != null && !double.isnan(r.validationmetrics.rsquared))
                .orderbydescending(r => r.validationmetrics.rsquared);

            console.writeline("top models ranked by r-squared --");
            printregressionmetricsheader();
            for (var i = 0; i < topruns.count(); i++)
            {
                var run = topruns.elementat(i);
                printiterationmetrics(i + 1, run.trainername, run.validationmetrics, run.runtimeinseconds);
            }
        }       

        public static void printregressionmetricsheader()
        {
            createrow($"{"",-4} {"trainer",-35} {"rsquared",8} {"absolute-loss",13} {"squared-loss",12} {"rms-loss",8} {"duration",9}", width);
        }

        public static void printiterationmetrics(int iteration, string trainername, regressionmetrics metrics, double? runtimeinseconds)
        {
            createrow($"{iteration,-4} {trainername,-35} {metrics?.rsquared ?? double.nan,8:f4} {metrics?.meanabsoluteerror ?? double.nan,13:f2} {metrics?.meansquarederror ?? double.nan,12:f2} {metrics?.rootmeansquarederror ?? double.nan,8:f2} {runtimeinseconds.value,9:f1}", width);
        }

        public static void createrow(string message, int width)
        {
            console.writeline("|" + message.padright(width - 2) + "|");
        }
}

 其中createrow代码功能用于排版。调试结果如下:

top models ranked by r-squared --
|     trainer                             rsquared absolute-loss squared-loss rms-loss  duration                 |
|1    fasttreetweedieregression             0.4731          0.46         0.41     0.64      63.6                 |
|2    fasttreetweedieregression             0.4431          0.49         0.43     0.65      14.7                 |
|3    fasttreeregression                    0.4386          0.54         0.49     0.70      19.4                 |
|4    lightgbmregression                    0.4177          0.52         0.45     0.67      10.8                 |
|5    fasttreeregression                    0.4102          0.51         0.45     0.67      14.8                 |
|6    lightgbmregression                    0.3944          0.52         0.46     0.68      11.2                 |
|7    lightgbmregression                    0.3501          0.60         0.57     0.75      10.6                 |
|8    fastforestregression                  0.3381          0.60         0.58     0.76      15.6                 |
|9    olsregression                         0.2829          0.56         0.53     0.73      10.9                 |
|10   lbfgspoissonregression                0.2760          0.62         0.63     0.80      11.2                 |
|11   sdcaregression                        0.2746          0.58         0.56     0.75      12.5                 |
|12   onlinegradientdescentregression       0.0593          0.69         0.81     0.90      10.5                 |

根据结果可以看到,一些算法被重复试验,但在使用同一个算法时其配置参数并不一样,如阙值、深度等。

 

2、获取最优模型

            rundetail<regressionmetrics> best = experimentresult.bestrun;
            itransformer trainedmodel = best.model;

 获取最佳模型后,其评估和保存的过程和之前代码一致。用测试数据评估结果:

*************************************************
*       metrics for fasttreetweedieregression regression model
*------------------------------------------------
*       lossfn:        0.67
*       r2 score:      0.34
*       absolute loss: .63
*       squared loss:  .67
*       rms loss:      .82
*************************************************

看结果识别率约70%左右,这种结果是没有办法用于生产的,问题应该是我们没有找到决定葡萄酒品质的关键特征。

 

五、小结

到这篇文章为止,《ml.net学习笔记系列》就结束了。学习过程中涉及的原始代码主要来源于:https://github.com/dotnet/machinelearning-samples 。

该工程中还有一些其他算法应用的例子,包括:聚类、矩阵分解、异常检测,其大体流程基本都差不多,有了我们这个系列的学习基础有兴趣的朋友可以自己研究一下。

  

六、资源获取 

源码下载地址:https://github.com/seabluescn/study_ml.net

回归工程名称:regression_winequality

automl工程名称:regression_winequality_automl

点击查看机器学习框架ml.net学习笔记系列文章目录

 

如对本文有疑问,请在下面进行留言讨论,广大热心网友会与你互动!! 点击进行留言回复

相关文章:

验证码:
移动技术网