当前位置：移动技术网 > IT编程>脚本编程>Python > Python机器学习！识别图中最难的数字！验证码？验证码是小儿科！

Python机器学习！识别图中最难的数字！验证码？验证码是小儿科！

2018年09月18日 | 移动技术网IT编程 | 我要评论

进群：548377875 即可获取数十套pdf哦！是分开私信哦！

现在我们看看digits数据集统计性信息

#一共有1797个数据和1797标签
print('照片数据形状（维度）: ', digits.data.shape)
print('标签数据形状（维度）: ', digits.target.shape)

运行

 照片数据形状（维度）: (1797, 64)
 标签数据形状（维度）: (1797,)

1.2 打印照片和其标签

因为数据的维度是1797条，一共有64个维度。那么每一条数据是一个列表。但是我们知道图片是二维结构，而且我们知道digits数据集的图片是方形，所以我们要将图片原始数据重构（reshape）为（8，8）的数组。

为了让大家对于数据集有一个更直观的印象，我们在这里打印digits数据集的前5张照片。

#先查看图片是什么样子
print(digits.data[0])
#重构图片数据为（8，8）的数组
import numpy as np
print(np.reshape(digits.data[0], (8,8)))

运行

 [ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.
 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.
 0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.
 0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
 [[ 0. 0. 5. 13. 9. 1. 0. 0.]
 [ 0. 0. 13. 15. 10. 15. 5. 0.]
 [ 0. 3. 15. 2. 0. 11. 8. 0.]
 [ 0. 4. 12. 0. 0. 8. 8. 0.]
 [ 0. 5. 8. 0. 0. 9. 8. 0.]
 [ 0. 4. 11. 0. 1. 12. 7. 0.]
 [ 0. 2. 14. 5. 10. 12. 0. 0.]
 [ 0. 0. 6. 13. 10. 0. 0. 0.]]

在notebook中显示matplotlib的图片

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
#选取数据集前5个数据
data = digits.data[0:5]
label = digits.target[0:5]
#画图尺寸宽20，高4
plt.figure(figsize = (20, 4))
for idx, (imagedata, label) in enumerate(zip(data, label)):
 #画布被切分为一行5个子图。 idx+1表示第idx+1个图
 plt.subplot(1, 5, idx+1)
 image = np.reshape(imagedata, (8, 8))
 #为了方便观看，我们将其灰度显示
 plt.imshow(image, cmap = plt.cm.gray)
 plt.title('the number of image is {}'.format(label))

png

1.3 将数据分为训练集合测试集

为了减弱模型对数据的过拟合的可能性，增强模型的泛化能力。保证我们训练的模型可以对新数据进行预测，我们需要将digits数据集分为训练集和测试集。

from sklearn.model_selection import train_test_split
#测试集占总数据中的30%， 设置随机状态，方便后续复现本次的随机切分
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size = 0.3, random_state=100)

1.4 训练、预测、准确率

在本文中，我们使用logisticregression。由于digits数据集合较小，我们就是用默认的solver即可

from sklearn.linear_model import logisticregression
logisticregre = logisticregression()
#训练
logisticregre.fit(x_train, y_train)

对新数据进行预测,注意如果只是对一个数据（一维数组）进行预测，一定要把该一维数组转化为矩阵形式。

data.reshape(n_rows, n_columns)

将data转化为维度为(n_rows, n_columns)的矩阵。注意，如果我们不知道要转化的矩阵的某一个维度的尺寸，可以将该值设为-1.

#测试集中的第一个数据。
#我们知道它是一行，但是如果不知道列是多少，那么设置为-1
#实际上，我们知道列是64 
#所以下面的写法等同于x_test[0].reshape(1, 64)
one_new_image = x_test[0].reshape(1, -1)
#预测
logisticregre.predict(one_new_image)

运行

array([9])

对多个数据进行预测

predictions = logisticregre.predict(x_test[0:10])
#真实的数字
print(y_test[0:10])
#预测的数字
print(predictions)
#准确率
score = logisticregre.score(x_test, y_test)
print(score)

运行结果

 [9 9 0 2 4 5 7 4 7 2]
 [9 3 0 2 4 5 7 4 3 2]
 0.9592592592592593

哇，还是很准的啊

1.5 混淆矩阵

一般评价预测准确率经常会用到混淆矩阵(confusion matrix)，这里我们使用seaborn和matplotlib绘制混淆矩阵。

% matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
predictions = logisticregre.predict(x_test)
cm = confusion_matrix(y_test, predictions)
plt.figure(figsize = (9, 9))
sns.heatmap(cm, annot=true, fmt='.3f', linewidth=0.5, square=true, cmap='blues_r')
plt.ylabel('actual label')
plt.xlabel('predicted label')
plt.title('accurate score: {}'.format(score), size=15)

png

二、mnist数据集

digits数据集特别的小，刚刚的训练和预测都只需几秒就可以搞定。但是如果数据集很大时，我们对于训练的速度的要求就变得紧迫起来，模型的参数调优就显得很有必要。所以，我们拿mnist这个大数据集试试手。我从网上将mnist下载下来，整理为csv文件。其中第一列为标签，之后的列为图片像素点的值。共785列。mnist数据集的图片是28*28组成的。

import pandas as pd
import numpy as np
train = pd.read_csv('mnist_train.csv', header = none)
test = pd.read_csv('mnist_test.csv', header = none)
y_train = train.loc[:, 0] #pd.series
#注意：train.loc[:, 1:]返回的是pd.dataframe类。
#这里我们要将其转化为np.array方便操作
x_train = np.array(train.loc[:, 1:]) 
y_test = test.loc[:, 0]
x_test = np.array(test.loc[:, 1:])
#我们看看这些mnist维度
print('x_train 维度: {}'.format(x_train.shape))
print('y_train 维度: {}'.format(y_train.shape))
print('x_test 维度: {}'.format(x_test.shape))
print('y_test 维度: {}'.format(y_test.shape))

运行结果

 x_train 维度: (60000, 784)
 y_train 维度: (60000,)
 x_test 维度: (10000, 784)
 y_test 维度: (10000,)

2.1 打印mnist图片和标签

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
#只看5张图片数据
data = x_train[0:5]
label = y_train[0:5]
plt.figure(figsize = (20, 4))
for idx, (imagedata, label) in enumerate(zip(data, label)):
 plt.subplot(1, 5, idx+1)
 #mnist数据集的图片为28*28像素
 image = np.reshape(imagedata, (28,28))
 plt.imshow(image, cmap=plt.cm.gray)
 plt.title('the number of image is {}'.format(label))

png

2.2 训练、预测、准确率

之前digits数据集才1797个，而且每个图片的尺寸是（8，8）。但是mnist数据集高达70000，每张图片的尺寸是（28，28）。所以如果不考虑参数合理选择，训练的速度会很慢。

from sklearn.linear_model import logisticregression
import time
def model(solver='liblinear'):
 """
 改变logisticregression模型的solver参数，计算运行准确率及时间
 """
 start = time.time()
 logisticregr = logisticregression(solver=solver)
 logisticregr.fit(x_train, y_train)
 score = logisticregr.score(x_test, y_test)
 end = time.time()
 print('准确率：{0}, 耗时: {1}'.format(score, int(end-start)))
 return logisticregr
model(solver='liblinear')
model(solver='lbfgs')

运行结果

 准确率：0.9176, 耗时3840
 准确率：0.9173, 耗时65

经过测试发现，在我的macbook air2015默认

solver='liblinear'训练时间3840秒。

solver='lbfgs'训练时间65秒。

solver从liblinear变为lbfgs，只牺牲了0.0003的准确率，速度却能提高了将近60倍。在机器学习训练中，算法参数不同，训练速度差异很大，看看下面这个图。

2.3 打印预测错误的图片

digits数据集使用的混淆矩阵查看准确率，但不够直观。这里我们打印预测错误的图片

logistricregr = model(solver='lbfgs')
predictions = logistricregr.predict(x_test)
#预测分类错误图片的索引
misclassifiedindexes = []
for idx,(label,predict) in enumerate(zip(y_test, predictions)):
 if label != predict:
 misclassifiedindexes.append(idx)
print(misclassifiedindexes)
准确率：0.9173, 耗时76
[8, 33, 38, 63, 66, 73, 119, 124, 149, 151, 153, 193, 211, 217, 218, 233, 241, 245, 247, 259, 282, 290, 307, 313, 318, 320, 
 ........ 
 857, 877, 881, 898, 924, 938, 939, 947, 16789808, 9811, 9832, 9835, 9839, 9840, 9855, 9858, 9867, 9874, 9883, 9888, 9892, 9893, 9901, 9905, 9916, 9925, 9926, 9941, 9943, 9944, 9959, 9970, 9975, 9980, 9982, 9986]

将错误图片打印出来

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize = (20, 4))
#打印前5个分类错误的图片
for plotidx, badidx in enumerate(misclassifiedindexes[0:5]):
 plt.subplot(1, 5, plotidx+1)
 img = np.reshape(x_test[badidx], (28, 28))
 plt.imshow(img)
 predict_label = predictions[badidx]
 true_label = y_test[badidx]
 plt.title('predicted: {0}, actual: {1}'.format(predict_label, true_label))

现在我们看看digits数据集统计性信息

#一共有1797个数据和1797标签
print('照片数据形状（维度）: ', digits.data.shape)
print('标签数据形状（维度）: ', digits.target.shape)

运行

 照片数据形状（维度）: (1797, 64)
 标签数据形状（维度）: (1797,)

1.2 打印照片和其标签

为了让大家对于数据集有一个更直观的印象，我们在这里打印digits数据集的前5张照片。

#先查看图片是什么样子
print(digits.data[0])
#重构图片数据为（8，8）的数组
import numpy as np
print(np.reshape(digits.data[0], (8,8)))

运行

 [ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.
 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.
 0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.
 0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
 [[ 0. 0. 5. 13. 9. 1. 0. 0.]
 [ 0. 0. 13. 15. 10. 15. 5. 0.]
 [ 0. 3. 15. 2. 0. 11. 8. 0.]
 [ 0. 4. 12. 0. 0. 8. 8. 0.]
 [ 0. 5. 8. 0. 0. 9. 8. 0.]
 [ 0. 4. 11. 0. 1. 12. 7. 0.]
 [ 0. 2. 14. 5. 10. 12. 0. 0.]
 [ 0. 0. 6. 13. 10. 0. 0. 0.]]

在notebook中显示matplotlib的图片

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
#选取数据集前5个数据
data = digits.data[0:5]
label = digits.target[0:5]
#画图尺寸宽20，高4
plt.figure(figsize = (20, 4))
for idx, (imagedata, label) in enumerate(zip(data, label)):
 #画布被切分为一行5个子图。 idx+1表示第idx+1个图
 plt.subplot(1, 5, idx+1)
 image = np.reshape(imagedata, (8, 8))
 #为了方便观看，我们将其灰度显示
 plt.imshow(image, cmap = plt.cm.gray)
 plt.title('the number of image is {}'.format(label))

png

1.3 将数据分为训练集合测试集

from sklearn.model_selection import train_test_split
#测试集占总数据中的30%， 设置随机状态，方便后续复现本次的随机切分
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size = 0.3, random_state=100)

1.4 训练、预测、准确率

在本文中，我们使用logisticregression。由于digits数据集合较小，我们就是用默认的solver即可

from sklearn.linear_model import logisticregression
logisticregre = logisticregression()
#训练
logisticregre.fit(x_train, y_train)

对新数据进行预测,注意如果只是对一个数据（一维数组）进行预测，一定要把该一维数组转化为矩阵形式。

data.reshape(n_rows, n_columns)

将data转化为维度为(n_rows, n_columns)的矩阵。注意，如果我们不知道要转化的矩阵的某一个维度的尺寸，可以将该值设为-1.

#测试集中的第一个数据。
#我们知道它是一行，但是如果不知道列是多少，那么设置为-1
#实际上，我们知道列是64 
#所以下面的写法等同于x_test[0].reshape(1, 64)
one_new_image = x_test[0].reshape(1, -1)
#预测
logisticregre.predict(one_new_image)

运行

array([9])

对多个数据进行预测

predictions = logisticregre.predict(x_test[0:10])
#真实的数字
print(y_test[0:10])
#预测的数字
print(predictions)
#准确率
score = logisticregre.score(x_test, y_test)
print(score)

运行结果

 [9 9 0 2 4 5 7 4 7 2]
 [9 3 0 2 4 5 7 4 3 2]
 0.9592592592592593

哇，还是很准的啊

1.5 混淆矩阵

一般评价预测准确率经常会用到混淆矩阵(confusion matrix)，这里我们使用seaborn和matplotlib绘制混淆矩阵。

% matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
predictions = logisticregre.predict(x_test)
cm = confusion_matrix(y_test, predictions)
plt.figure(figsize = (9, 9))
sns.heatmap(cm, annot=true, fmt='.3f', linewidth=0.5, square=true, cmap='blues_r')
plt.ylabel('actual label')
plt.xlabel('predicted label')
plt.title('accurate score: {}'.format(score), size=15)

png

二、mnist数据集

import pandas as pd
import numpy as np
train = pd.read_csv('mnist_train.csv', header = none)
test = pd.read_csv('mnist_test.csv', header = none)
y_train = train.loc[:, 0] #pd.series
#注意：train.loc[:, 1:]返回的是pd.dataframe类。
#这里我们要将其转化为np.array方便操作
x_train = np.array(train.loc[:, 1:]) 
y_test = test.loc[:, 0]
x_test = np.array(test.loc[:, 1:])
#我们看看这些mnist维度
print('x_train 维度: {}'.format(x_train.shape))
print('y_train 维度: {}'.format(y_train.shape))
print('x_test 维度: {}'.format(x_test.shape))
print('y_test 维度: {}'.format(y_test.shape))

运行结果

 x_train 维度: (60000, 784)
 y_train 维度: (60000,)
 x_test 维度: (10000, 784)
 y_test 维度: (10000,)

2.1 打印mnist图片和标签

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
#只看5张图片数据
data = x_train[0:5]
label = y_train[0:5]
plt.figure(figsize = (20, 4))
for idx, (imagedata, label) in enumerate(zip(data, label)):
 plt.subplot(1, 5, idx+1)
 #mnist数据集的图片为28*28像素
 image = np.reshape(imagedata, (28,28))
 plt.imshow(image, cmap=plt.cm.gray)
 plt.title('the number of image is {}'.format(label))

png

2.2 训练、预测、准确率

from sklearn.linear_model import logisticregression
import time
def model(solver='liblinear'):
 """
 改变logisticregression模型的solver参数，计算运行准确率及时间
 """
 start = time.time()
 logisticregr = logisticregression(solver=solver)
 logisticregr.fit(x_train, y_train)
 score = logisticregr.score(x_test, y_test)
 end = time.time()
 print('准确率：{0}, 耗时: {1}'.format(score, int(end-start)))
 return logisticregr
model(solver='liblinear')
model(solver='lbfgs')

运行结果

 准确率：0.9176, 耗时3840
 准确率：0.9173, 耗时65

经过测试发现，在我的macbook air2015默认

solver='liblinear'训练时间3840秒。

solver='lbfgs'训练时间65秒。

solver从liblinear变为lbfgs，只牺牲了0.0003的准确率，速度却能提高了将近60倍。在机器学习训练中，算法参数不同，训练速度差异很大，看看下面这个图。

2.3 打印预测错误的图片

digits数据集使用的混淆矩阵查看准确率，但不够直观。这里我们打印预测错误的图片

logistricregr = model(solver='lbfgs')
predictions = logistricregr.predict(x_test)
#预测分类错误图片的索引
misclassifiedindexes = []
for idx,(label,predict) in enumerate(zip(y_test, predictions)):
 if label != predict:
 misclassifiedindexes.append(idx)
print(misclassifiedindexes)
准确率：0.9173, 耗时76
[8, 33, 38, 63, 66, 73, 119, 124, 149, 151, 153, 193, 211, 217, 218, 233, 241, 245, 247, 259, 282, 290, 307, 313, 318, 320, 
 ........ 
 857, 877, 881, 898, 924, 938, 939, 947, 16789808, 9811, 9832, 9835, 9839, 9840, 9855, 9858, 9867, 9874, 9883, 9888, 9892, 9893, 9901, 9905, 9916, 9925, 9926, 9941, 9943, 9944, 9959, 9970, 9975, 9980, 9982, 9986]

将错误图片打印出来

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize = (20, 4))
#打印前5个分类错误的图片
for plotidx, badidx in enumerate(misclassifiedindexes[0:5]):
 plt.subplot(1, 5, plotidx+1)
 img = np.reshape(x_test[badidx], (28, 28))
 plt.imshow(img)
 predict_label = predictions[badidx]
 true_label = y_test[badidx]
 plt.title('predicted: {0}, actual: {1}'.format(predict_label, true_label))

代码就不分享了！

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

公认8个效率最高的爬虫框架

一些较为高效的python爬虫框架。分享给大家。1.scrapyscrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信... [阅读全文]
用python写爬虫简单吗

所谓网络爬虫，通俗的讲，就是通过向我们需要的url发出http请求，获取该url对应的http报文主体内容，之后提取该报文主体中我们所需要的信息。下面是一个简单... [阅读全文]
Python selenium键盘鼠标事件实现过程详解

引言----在实际的web测试工作中，需要配合键盘按键来操作，webdriver的 keys()类提供键盘上所有按键的操作，还可以模拟组合键ctrl+a，ctr... [阅读全文]
Python爬虫之爬取淘女郎照片示例详解

本篇目标抓取淘宝mm的姓名，头像，年龄抓取每一个mm的资料简介以及写真图片把每一个mm的写真图片按照文件夹保存到本地熟悉文件保存的过程1.url... [阅读全文]
Python如何定义接口和抽象类

问题你想定义一个接口或抽象类，并且通过执行类型检查来确保子类实现了某些特定的方法解决方案使用 abc 模块可以很轻松的定义抽象基类：from abc impor... [阅读全文]
C语言调用Python代码的方法

问题你想在c中安全的执行某个python调用并返回结果给c。例如，你想在c语言中使用某个python函数作为一个回调。解决方案在c语言中调用python非常简... [阅读全文]
Python绘图之柱形图绘制详解

前言用python编程绘图，其实非常简单。中学生、大学生、研究生都能通过这10篇教程从入门到精通！快速绘制几种简单的柱状图。1垂直柱图（普通柱图）绘制普通柱图的... [阅读全文]
python的一个小游戏

最近前言游戏规则素材代码最后前言因为python没学完的我在爬虫领域停了蛮久，所以最近都在学小甲鱼的python... [阅读全文]
Leetcode rever-Integer

题目描述将给出的整数x翻转。例1:x=123，返回321例2:x=-123，返回-321你有思考过下面的这些问题... [阅读全文]
【状压dp】[HDU1400 & poj2411] Mondriaan‘s Dream

对于棋盘上每个点，都有几种可能，1.不放 2.横放的前一个 3.横放的后一个 4.竖放的上一个 5.竖放的下一个... [阅读全文]

网友评论


验证码：

Python机器学习！识别图中最难的数字！验证码？验证码是小儿科！

2018年09月18日 | 移动技术网IT编程 | 我要评论

1.2 打印照片和其标签

1.3 将数据分为训练集合测试集

1.4 训练、预测、准确率

1.5 混淆矩阵

二、mnist数据集

2.1 打印mnist图片和标签

2.2 训练、预测、准确率

2.3 打印预测错误的图片

1.2 打印照片和其标签

1.3 将数据分为训练集合测试集

1.4 训练、预测、准确率

1.5 混淆矩阵

二、mnist数据集

2.1 打印mnist图片和标签

2.2 训练、预测、准确率

2.3 打印预测错误的图片

您可能感兴趣的文章:

相关文章:

网友评论