当前位置：移动技术网 > IT编程>数据库>其他数据库 > Python机器学习之数据探索可视化库yellowbrick-tutorial

Python机器学习之数据探索可视化库yellowbrick-tutorial

2019年08月20日 | 移动技术网IT编程 | 我要评论

背景介绍

从学sklearn时，除了算法的坎要过，还得学习matplotlib可视化，对我的实践应用而言，可视化更重要一些，然而matplotlib的易用性和美观性确实不敢恭维。陆续使用过plotly、seaborn，最终定格在了bokeh，因为它可以与flask完美的结合，数据看板的开发难度降低了很多。

前阵子看到这个库可以较为便捷的实现数据探索，今天得空打算学习一下。原本访问的是英文文档，结果发现已经有人在做汉化，虽然看起来也像是谷歌翻译的，本着拿来主义，少费点精力的精神，就半抄半学，还是发现了一些与文档不太一致的地方。

# http://www.scikit-yb.org/zh/latest/tutorial.html

模型选择教程

在本教程中，我们将查看各种 scikit-learn 模型的分数，并使用 yellowbrick 的可视化诊断工具对其进行比较，以便为我们的数据选择最佳模型。

模型选择三元组

关于机器学习的讨论常常集中在模型选择上。无论是逻辑回归、随机森林、贝叶斯方法，还是人工神经网络，机器学习实践者通常都能很快地展示他们的偏好。这主要是因为历史原因。尽管现代的第三方机器学习库使得各类模型的部署显得微不足道，但传统上，即使是其中一种算法的应用和调优也需要多年的研究。因此，与其他模型相比，机器学习实践者往往对特定的(并且更可能是熟悉的)模型有强烈的偏好。

然而，模型选择比简单地选择“正确”或“错误”算法更加微妙。实践中的工作流程包括:

选择和/或设计最小和最具预测性的特性集
从模型家族中选择一组算法，并且
优化算法超参数以优化性能。

模型选择三元组是由kumar 等人，在 2015 年的 sigmod 论文中首次提出。在他们的论文中，谈论到下一代为预测建模而构建的数据库系统的开发。作者很中肯地表示，由于机器学习在实践中具有高度实验性，因此迫切需要这样的系统。“模型选择，”他们解释道，“是迭代的和探索性的，因为(模型选择三元组)的空间通常是无限的，而且通常不可能让分析师事先知道哪个(组合)将产生令人满意的准确性和/或洞察力。”

最近，许多工作流程已经通过网格搜索方法、标准化 api 和基于 gui 的应用程序实现了自动化。然而，在实践中，人类的直觉和指导可以比穷举搜索更有效地专注于模型质量。通过可视化模型选择过程，数据科学家可以转向最终的、可解释的模型，并避免陷阱。

yellowbrick 库是一个针对机器学习的可视化诊断平台，它允许数据科学家控制模型选择过程。yellowbrick 用一个新的核心对象扩展了scikit-learn 的 api: visualizer。visualizers 允许可视化模型作为scikit-learn管道过程的一部分进行匹配和转换，从而在高维数据的转换过程中提供可视化诊断。

关于数据

本教程使用来自 uci machine learning repository 的修改过的蘑菇数据集版本。我们的目标是基于蘑菇的特定，去预测蘑菇是有毒的还是可食用的。

这些数据包括与伞菌目(agaricus)和环柄菇属(lepiota)科中23种烤蘑菇对应的假设样本描述。每一种都被确定为绝对可食用，绝对有毒，或未知的可食用性和不推荐（后一类与有毒物种相结合）。

我们的文件“agaricus-lepiota.txt”，包含3个名义上有价值的属性信息和8124个蘑菇实例的目标值(4208个可食用，3916个有毒)。

让我们用pandas加载数据。

import os
import pandas as pd
mushrooms = 'data/shrooms.csv'  # 数据集
dataset   = pd.read_csv(mushrooms)
# dataset.columns = names
dataset.head()

	id	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	unnamed: 24
0	1	p	x	s	n	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	nan
1	2	e	x	s	y	t	a	f	c	b	...	w	w	p	w	o	p	n	n	g	nan
2	3	e	b	s	w	t	l	f	c	b	...	w	w	p	w	o	p	n	n	m	nan
3	4	p	x	y	w	t	p	f	c	n	...	w	w	p	w	o	p	k	s	u	nan
4	5	e	x	s	g	f	n	f	w	b	...	w	w	p	w	o	e	n	a	g	nan

5 rows × 25 columns

features = ['cap-shape', 'cap-surface', 'cap-color']
target   = ['class']
x = dataset[features]
y = dataset[target]

dataset.shape # 较官方文档少了俩蘑菇

(8122, 25)

dataset.groupby('class').count() # 各少了1个蘑菇

	id	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat	unnamed: 24
class
e	4207	4207	4207	4207	4207	4207	4207	4207	4207	4207	...	4207	4207	4207	4207	4207	4207	4207	4207	4207	0
p	3915	3915	3915	3915	3915	3915	3915	3915	3915	3915	...	3915	3915	3915	3915	3915	3915	3915	3915	3915	0

2 rows × 24 columns

特征提取

我们的数据，包括目标参数，都是分类型数据。为了使用机器学习，我们需要将这些值转化为数值型数据。为了从数据集中提取这一点，我们必须使用scikit-learn的转换器（transformers）将输入数据集转换为适合模型的数据集。幸运的是，sckit-learn提供了一个转换器，用于将分类标签转换为整数: sklearn.preprocessing.labelencoder。不幸的是，它一次只能转换一个向量，所以我们必须对它进行调整，以便将它应用于多个列。
有疑问，这个蘑菇分类就是一个向量？

from sklearn.base import baseestimator, transformermixin
from sklearn.preprocessing import labelencoder, onehotencoder
class encodecategorical(baseestimator, transformermixin):
    """
    encodes a specified list of columns or all columns if none.
    """

    def __init__(self, columns=none):
        self.columns  = [col for col in columns]
        self.encoders = none

    def fit(self, data, target=none):
        """
        expects a data frame with named columns to encode.
        """
        # encode all columns if columns is none
        if self.columns is none:
            self.columns = data.columns

        # fit a label encoder for each column in the data frame
        self.encoders = {
            column: labelencoder().fit(data[column])
            for column in self.columns
        }
        return self

    def transform(self, data):
        """
        uses the encoders to transform a data frame.
        """
        output = data.copy()
        for column, encoder in self.encoders.items():
            output[column] = encoder.transform(data[column])

        return output

建模与评估

评估分类器的常用指标

精确度(precision) 是正确的阳性结果的数量除以所有阳性结果的数量(例如，我们预测的可食用蘑菇实际上有多少?)

召回率(recall) 是正确的阳性结果的数量除以应该返回的阳性结果的数量(例如，我们准确预测了多少有毒蘑菇是有毒的?)

f1分数(f1 score) 是测试准确度的一种衡量标准。它同时考虑测试的精确度和召回率来计算分数。f1得分可以解释为精度和召回率的加权平均值，其中f1得分在1处达到最佳值，在0处达到最差值。
precision = true positives / (true positives + false positives)

recall = true positives / (false negatives + true positives)

f1 score = 2 * ((precision * recall) / (precision + recall))
现在我们准备好作出一些预测了！

让我们构建一种评估多个估算器(multiple estimators)的方法 —— 首先使用传统的数值分数（我们稍后将与yellowbrick库中的一些可视化诊断进行比较）。

from sklearn.metrics import f1_score
from sklearn.pipeline import pipeline
def model_selection(x, y, estimator):
    """
    test various estimators.
    """
    y = labelencoder().fit_transform(y.values.ravel())
    model = pipeline([
         ('label_encoding', encodecategorical(x.keys())),
         ('one_hot_encoder', onehotencoder(categories='auto')),  # 此处增加自动分类，否则有warning
         ('estimator', estimator)
    ])

    # instantiate the classification model and visualizer
    model.fit(x, y)

    expected  = y
    predicted = model.predict(x)

    # compute and return the f1 score (the harmonic mean of precision and recall)
    return (f1_score(expected, predicted))

from sklearn.svm import linearsvc, nusvc, svc
from sklearn.neighbors import kneighborsclassifier
from sklearn.linear_model import logisticregressioncv, logisticregression, sgdclassifier
from sklearn.ensemble import baggingclassifier, extratreesclassifier, randomforestclassifier

model_selection(x, y, linearsvc())

0.6582119537920643

import warnings
warnings.filterwarnings("ignore", category=futurewarning, module="sklearn")  # 忽略警告

model_selection(x, y, nusvc())

0.6878837238441299

model_selection(x, y, svc())

0.6625145971195017

model_selection(x, y, sgdclassifier())

0.5738408700629649

model_selection(x, y, kneighborsclassifier())

0.6856846473029046

model_selection(x, y, logisticregressioncv())

0.6582119537920643

model_selection(x, y, logisticregression())

0.6578749058025622

model_selection(x, y, baggingclassifier())

0.6873901878632248

model_selection(x, y, extratreesclassifier())

0.6872294372294372

model_selection(x, y, randomforestclassifier())

0.6992081007399714

初步模型评估

根据上面f1分数的结果，哪个模型表现最好？

可视化模型评估

现在，让我们重构模型评估函数，使用yellowbrick的classificationreport类，这是一个模型可视化工具，可以显示精确度、召回率和f1分数。这个可视化的模型分析工具集成了数值分数以及彩色编码的热力图，以支持简单的解释和检测，特别是对于我们用例而言非常相关(性命攸关!)的第一类错误(type i error)和第二类错误(type ii error)的细微差别。

第一类错误 (或 "假阳性(false positive)" ) 是检测一种不存在的效应(例如，当蘑菇实际上是可以食用的时候，它是有毒的)。

第二类错误 (或 “假阴性”"false negative" ) 是未能检测到存在的效应(例如，当蘑菇实际上有毒时，却认为它是可以食用的)。

from sklearn.pipeline import pipeline
from yellowbrick.classifier import classificationreport


def visual_model_selection(x, y, estimator):
    """
    test various estimators.
    """
    y = labelencoder().fit_transform(y.values.ravel())
    model = pipeline([
         ('label_encoding', encodecategorical(x.keys())),
         ('one_hot_encoder', onehotencoder()),
         ('estimator', estimator)
    ])

    # instantiate the classification model and visualizer
    visualizer = classificationreport(model, classes=['edible', 'poisonous'])
    visualizer.fit(x, y)
    visualizer.score(x, y)
    visualizer.poof()

visual_model_selection(x, y, linearsvc())

file

# 其他分类器可视化略
visual_model_selection(x, y, randomforestclassifier())

file

检验

现在,哪种模型看起来最好?为什么?
哪一个模型最有可能救你的命?
可视化模型评估与数值模型评价，体验起来有何不同?

准确率precision召回率recall以及综合评价指标f1-measure

f1-score综合考虑的准确率和召回率。
可视化就是直观嘛，逃~

作者简介

知乎yeayee，py龄5年，善flask+mongodb+sklearn+bokeh

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

Spark中Broadcast的理解

广播变量应用场景：在提交作业后，task在执行的过程中，有一个或多个值需要在计算的过程中多次从Driver端拿取时，此时会必然会发生大量的网络IO， ... [阅读全文]
实时流式计算系统中的几个陷阱

随着诸如Apache Flink，Apache Spark，Apache Storm之类的开源框架以及诸如Google Dataflow之类的云框架的增... [阅读全文]
DataHub——实时数据治理平台

DataHub 首先，阿里云也有一款名为DataHub的产品，是一个流式处理平台，本文所述DataHub与其无关。数据治理是大佬们最近谈的一个火热的话... [阅读全文]
去 HBase，Kylin on Parquet 性能表现如何？

Kylin on HBase 方案经过长时间的发展已经比较成熟，但也存在着局限性，因此，Kyligence 推出了 Kylin on Parquet 方... [阅读全文]
如何找到Hive提交的SQL相对应的Yarn程序的applicationId

最近的工作是利用Hive做数据仓库的ETL转换，大致方式是将ETL转换逻辑写在一个hsql文件中，脚本当中都是简单的SQL语句，不包含判断、循环等存储过... [阅读全文]
HBase Filter 过滤器之RowFilter详解

前言：本文详细介绍了HBase RowFilter过滤器Java&Shell API的使用，并贴出了相关示例代码以供参考。RowFilter 基于行键... [阅读全文]
字符串相似度处理函数

oracle里面查比如存储过程里面与表SALES有关jobs: SELECT * FROM (SELECT a.name,upper(b.what)AS... [阅读全文]
如何在 HBase Shell 命令行正常查看十六进制编码的中文？哈哈~

今天比较开心，只想哈哈~哈哈哈~ 啥也不多说了，直接看示例吧！绝对比我口才好~ 哈哈！Get到了吗？好意思不帮我分享嘛~哈哈~ 转载请注明出处！欢迎关注... [阅读全文]
一小时搭建实时数据分析平台

实时数据分析门槛较高，我们如何用极少的开发工作就完成实时数据平台的搭建，做出炫酷的图表呢？如何快速的搭建实时数据分析平台，首先我们需要实时数据的接入端... [阅读全文]
Kylin on Parquet 介绍和快速上手

Apache Kylin on Apache HBase 方案经过长时间的发展已经比较成熟，但是存在着一定的局限性。因此，Kyligence 推出了 K... [阅读全文]