当前位置：移动技术网 > IT编程>脚本编程>Python > 使用sklearn对iris数据集进行聚类分析

使用sklearn对iris数据集进行聚类分析

2020年07月16日 | 移动技术网IT编程 | 我要评论

导入库

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline
sns.set(style="white")
pd.set_option("display.max_rows", 1000)

sklearn自带iris数据集（nrow=150）

4个预测变量
3分类结局

iris = load_iris()
X = iris["data"]
Y = iris["target"]

display(X[:5])
display(pd.Series(Y).value_counts())

Y = Y.reshape(-1, 1) # Y的形状转换为[150, 1]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

2    50
1    50
0    50
dtype: int64

data = pd.DataFrame(np.concatenate((X, Y), axis=1),
                    columns=["x1", "x2", "x3", "x4", "y"])
data["y"] = data["y"].astype("int64")
data.head()

	x1	x2	x3	x4
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

观察数据分布

4个预测变量两两散点图

sns.pairplot(data, hue="y")

数据标准化

Kmeans聚类前应对数据进行标准化

scaler = MinMaxScaler()
data.iloc[:, :4] = scaler.fit_transform(data.iloc[:, :4])
data.head()

	x1	x2	x3	x4
0	0.222222	0.625000	0.067797	0.041667
1	0.166667	0.416667	0.067797	0.041667
2	0.111111	0.500000	0.050847	0.041667
3	0.083333	0.458333	0.084746	0.041667
4	0.194444	0.666667	0.067797	0.041667

设置类别数为3，进行Kmeans聚类

clus = KMeans(n_clusters=3)
clus = clus.fit(data.iloc[:, 1:4])

聚类完成后150个样本的聚类标签

clus.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

聚类完成后三个聚类中心

clus.cluster_centers_

array([[0.595     , 0.07830508, 0.06083333],
       [0.2975    , 0.55661017, 0.50583333],
       [0.42916667, 0.76745763, 0.8075    ]])

聚类的评估指标

clus.inertia_

4.481991774793322

确定最佳聚类数目

尝试不同的类别数，查看criterion值（越小越好），画出“肘线图”

L = []
for i in range(1, 9):
    clus = KMeans(n_clusters=i)
    clus.fit(data.iloc[:, 1:3])
    L.append([i, clus.inertia_])
L = pd.DataFrame(L, columns=["k", "criterion"])
L

	k	criterion
0	1	18.253249
1	2	5.106290
2	3	3.312646
3	4	2.585065
4	5	1.946648
5	6	1.637264
6	7	1.387541
7	8	1.175937

sns.pointplot(x="k", y="criterion", data=L)
sns.despine()

output_24_0

根据选定的聚类模型，对样本进行预测

从“肘线图”可看出最佳类别数等于3或4较好，此处使用3

clus = KMeans(n_clusters=3)
clus = clus.fit(data.iloc[:, 1:4])
data["pred"] = clus.predict(data.iloc[:, 1:4])

data.loc[data["pred"] == 0, "Pred"] = 11
data.loc[data["pred"] == 1, "Pred"] = 0
data.loc[data["pred"] == 2, "Pred"] = 2
data.loc[data["Pred"] == 11, "Pred"] = 1
data["Pred"] = data["Pred"].astype("int64")
data.head()

	x1	x2	x3	x4	pred
0	0.222222	0.625000	0.067797	0.041667	1
1	0.166667	0.416667	0.067797	0.041667	1
2	0.111111	0.500000	0.050847	0.041667	1
3	0.083333	0.458333	0.084746	0.041667	1
4	0.194444	0.666667	0.067797	0.041667	1

画出预测混淆矩阵，计算准确率

df = pd.crosstab(data["y"], data["Pred"])
df

Pred	0	1	2
y
0	50	0	0
1	0	46	4
2	0	4	46

L = []
for i in range(df.shape[0]):
    for j in range(df.shape[1]):
        if i != j:
            L.append(df.iloc[i, j])
print("预测准确率为：", round((150 - sum(L)) / 150 * 100, 1), "%")

预测准确率为： 94.7 %

本文地址：https://blog.csdn.net/weixin_40575651/article/details/107334269

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

从C语言中读取Python 类文件对象

问题你要写c扩展来读取来自任何python类文件对象中的数据（比如普通文件、stringio对象等）。解决方案要读取一个类文件对象的数据，你需要重复调用 rea... [阅读全文]
Python3爬虫关于代理池的维护详解

我们在上一节了解了代理的设置方法，利用代理我们可以解决目标网站封 ip 的问题，而在网上又有大量公开的免费代理，其中有一部分可以拿来使用，或者我们也可以购买付费... [阅读全文]
Python如何对齐字符串

问题你想通过某种对齐方式来格式化字符串解决方案对于基本的字符串对齐操作，可以使用字符串的 ljust() , rjust() 和 center() 方法。比如：... [阅读全文]
python实现从无序的链表中删除重复项

python实现从无序的链表中删除重复项题目描述:给定一个没有排序的链表，去掉其重复项，并保留原顺序，例如链表... [阅读全文]
python实现Canny与Hough算法

任务说明：编写一个钱币定位系统，其不仅能够检测出输入图像中各个钱币的边缘，同时，还能给出各个钱币的圆心坐标与半径... [阅读全文]
DP-LeetCode221. 最大正方形

1、题目描述https://leetcode-cn.com/problems/maximal-square/在一... [阅读全文]
听课笔记--Python数据分析--Numpy基础及基本应用

'''@Author: Liang@LastEditors: Liang@Date: 2020-07-26 19... [阅读全文]
评价类模型——Tposis法

Tposis法学习笔记适用的范围操作方法第一步 > 将原始矩阵正向化第二步>正向化矩阵标准化第三步&... [阅读全文]
python的platform模块的使用

platform是用来获取操作系统的信息的模块，具体见文档[root@VM_0_9_centos ~]# pyt... [阅读全文]
Python-定时任务APScheduler中两种调度器的区别

概述两种调度器BackgroundScheduler和BlockingScheduler的区别举例说明APSch... [阅读全文]

网友评论


验证码：

	x1	x2	x3	x4
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	x1	x2	x3	x4
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2