惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Stack Overflow Blog
Stack Overflow Blog
WordPress大学
WordPress大学
罗磊的独立博客
S
Secure Thoughts
Schneier on Security
Schneier on Security
博客园 - Franky
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
爱范儿
爱范儿
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Hacker News: Ask HN
Hacker News: Ask HN
PCI Perspectives
PCI Perspectives
Google DeepMind News
Google DeepMind News
S
Security Affairs
SecWiki News
SecWiki News
博客园 - 聂微东
Security Archives - TechRepublic
Security Archives - TechRepublic
Google Online Security Blog
Google Online Security Blog
H
Heimdal Security Blog
S
Security @ Cisco Blogs
Engineering at Meta
Engineering at Meta
C
CXSECURITY Database RSS Feed - CXSecurity.com
Cloudbric
Cloudbric
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Visual Studio Blog
P
Proofpoint News Feed
Project Zero
Project Zero
T
Threat Research - Cisco Blogs
Webroot Blog
Webroot Blog
Blog — PlanetScale
Blog — PlanetScale
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
W
WeLiveSecurity
Last Week in AI
Last Week in AI
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
M
MIT News - Artificial intelligence
有赞技术团队
有赞技术团队
S
Securelist
GbyAI
GbyAI
Application and Cybersecurity Blog
Application and Cybersecurity Blog
C
CERT Recently Published Vulnerability Notes
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Cyberwarzone
Cyberwarzone
B
Blog RSS Feed
P
Palo Alto Networks Blog
H
Hacker News: Front Page
D
Docker
雷峰网
雷峰网
Latest news
Latest news
Microsoft Security Blog
Microsoft Security Blog

🛫Qifei's Blog

Python的模块构建和调用方式 - 🛫Qifei's Blog 如何选择搭建梯子用的VPS? - 🛫Qifei's Blog Hexo博客页面引入足迹地图 - 🛫Qifei's Blog 利用国内服务器加速v2ray访问速度和规避检查 - 🛫Qifei's Blog Matplotlib_Bar3d绘制彩色带颜色标尺的3D柱形图 - 🛫Qifei's Blog Linux后台运行nostr虚荣公钥的挖掘 - 🛫Qifei's Blog Hexo博客添加Nostr_NIP-05认证 - 🛫Qifei's Blog Nostr_NIP-05认证服务简介和配置 - 🛫Qifei's Blog Onedrive_API大文件上传记录 - 🛫Qifei's Blog 反爬虫-如何检测有没有使用Puppeteer - 🛫Qifei's Blog Vuejs学习笔记 - 🛫Qifei's Blog OpenOffice连接Mysql数据库 - 🛫Qifei's Blog 安卓ToyVPN服务端从零开始读 - 🛫Qifei's Blog 在控制器中用JS验证CSS_Selector和Xpath - 🛫Qifei's Blog RWTH自习室自动预定程序(RWTH_Lernraum) - 🛫Qifei's Blog RWTH自习室自动预定程序(RWTH_Lernraum) - 🛫Qifei's Blog 如何利用Selenium实现更加高效的爬虫 - 🛫Qifei's Blog Python如何准确的计算Http请求中的Content_Length - 🛫Qifei's Blog 破解图片等资源跨域和防盗链阻拦 - 🛫Qifei's Blog Selenium-Webdriver接口 - 🛫Qifei's Blog
机器学习和scikit-learn库基础学习笔记 - 🛫Qifei's Blog
Qifei · 2023-12-02 · via 🛫Qifei's Blog

1. 数据分析和可视化

在做数据模型的选择之前,通常需要对数据进行可视化,以寻找数据之间的可见关系,比如两个数据是否存在线性关系。这里列举方便快捷的可视化操作,来帮助我们快速找到数据之间的关系。

1.1. 单变量

1.1.1. 箱图

1
2
3
# dataset为dataframe
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

下图是绘制出的box图,图形中,绿线为平均值,方框为25%与75%数值范围,上下边界值,圆圈标记异常值。

机器学习和scikit-learn库基础学习笔记_001.png

1.1.2. 柱状图

1
2
3
4
...
# histograms
dataset.hist()
plt.show()

绘制柱状图

机器学习和scikit-learn库基础学习笔记_002.png

1.2. 多变量

1.2.1. 变量之间的数据分布

1
2
3
4
5
from pandas.plotting import scatter_matrix
...
# scatter plot matrix
scatter_matrix(dataset)
plt.show()

绘制关系图

机器学习和scikit-learn库基础学习笔记_003.png

1.2.2. 变量相关性

显示相关性数据及热图,数越大相关

1
2
3
4
5
6
7
import matplotlib.pyplot as plt
import seaborn as sns
# dataset is dataframe
correlation = dataset.corr()
# display(correlation)
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")

机器学习和scikit-learn库基础学习笔记_004.png

1.2.3. 两个变量之间的对比

1
2
3
4
5
6
7
8
9
10
11
12
13
#Visualize the co-relation between pH and fixed Acidity

#Create a new dataframe containing only pH and fixed acidity columns to visualize their co-relations
fixedAcidity_pH = dataset[['pH', 'fixed acidity']]

#Initialize a joint-grid with the dataframe, using seaborn library
gridA = sns.JointGrid(x="fixed acidity", y="pH", data=fixedAcidity_pH, size=6)

#Draws a regression plot in the grid
gridA = gridA.plot_joint(sns.regplot, scatter_kws={"s": 10})

#Draws a distribution plot in the same grid
gridA = gridA.plot_marginals(sns.distplot)

机器学习和scikit-learn库基础学习笔记_005.png

2. 模型选择

sklearn提许多模型,对于不同的数据集,不同的模型训练之后会产生不同的结果。

常用的模型和导入方式

1
2
3
4
5
6
7
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

3. 模型训练

对于一个选择的模型,要对其进行训练。首先,需要将数据集分为训练集和验证集,然后训练模型,输出精确度。

1
2
3
4
5
6
7
8
9
from sklearn.model_selection import train_test_split
...
#数据集提取x和y,x对应属性,y对应分类
X_array = dataset.values[:,:4]
y_array = dataset.values[:,4]
#按照1:4获取验证集和训练集
X_train,X_test,y_train,y_test = train_test_split(X_array,y_array,test_size=0.2,shuffle=True)
model= SVC()
model.fit(X_train,y_train)

4. 模型打分,预测,和保存

1
2
3
4
5
6
7
8
9
10
11
12
print(model.score(X_test,y_test))
prediction = model.predict(X_test)
import joblib

# save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)

# some time later...

# load the model from disk
loaded_model = joblib.load(filename)

5. 参考

Your First Machine Learning Project in Python Step-By-Step

How to Use Data Science to Understand What Makes Wine Taste Good

Save and Load Machine Learning Models in Python with scikit-learn

评论