k-Means Clustering
Goal¶
This post aims to introduce k-means clustering using artificial data.
Libraries¶
In [1]:
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Create artificial clustered data¶
Parameters¶
In [2]:
# 3 clusters in 2D
d_means = {'cluster 1': [0, 0],
'cluster 2': [4, 5],
'cluster 3': [5, 0]}
d_covs = {'cluster 1': [[1, 1],
[1, 4]],
'cluster 2': [[1, 1],
[1, 3]],
'cluster 3': [[4, 2],
[2, 2]]}
Ground Truth Data¶
In [3]:
# Generate data based on the above parameters
df_data = pd.DataFrame(columns=['x', 'y'])
for i, (cluster, mean) in enumerate(d_means.items()):
x, y = np.random.multivariate_normal(mean, d_covs[cluster], 1000).T
df_data = df_data.append(pd.DataFrame({'x':x, 'y': y}))
plt.plot(x, y, '.', label=cluster, alpha=0.5)
# plt.axis('equal')
plt.legend()
plt.show()
Observed Data¶
In [4]:
df_data.head()
Out[4]:
In [5]:
df_data.plot(kind='scatter', x='x', y='y', s=0.5);
K-means clustering¶
In [6]:
cls = KMeans(init='k-means++', n_clusters=3, n_init=10)
cls.fit(df_data)
Out[6]:
In [7]:
df_data['predicted_label'] = cls.labels_.astype(int)
In [8]:
df_data.head(5)
Out[8]:
Check the predicted label by plot¶
In [9]:
sns.scatterplot(x="x", y="y", hue="predicted_label",
data=df_data, palette='Paired', s=20);
plt.legend(loc='lower right');
Comments
Comments powered by Disqus