I have applied K-Means on a problem where I have random dateset of age and weight, after applying K means, I still cannot understand that why data points are not converging to centroids.
age = [20,21,22,20,21,22,23,27,28,26,23,28,29,33,28,25,24,28,29,30,33,31,40,22,28,29,27,25,29,23]
weight = [80,50,70,50,82,49,60,66,76,45,66,86,77,56,76,81,72,92,72,91,72,75,60,61,99,54,56,45,40,76]
df_main = pd.DataFrame({'age':age_array,'weight':weight_array})
X_std = StandardScaler().fit_transform(df)
# Run local implementation of kmeans
km = KMeans(n_clusters=2, max_iter=100)
km.fit(X_std)
centroids = km.cluster_centers_
# Plot the clustered data
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_std[km.labels_ == 0, 0], X_std[km.labels_ == 0, 1],
c='green', label='cluster 1')
plt.scatter(X_std[km.labels_ == 1, 0], X_std[km.labels_ == 1, 1],
c='blue', label='cluster 2')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=50,
c='red', label='centroid')
plt.legend()
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Visualization of clustered data', fontweight='bold')
CodePudding user response:
I think you don't fully understand k-means.
What actually is happening that every iteration the centroids are recalculated to minimize the distance between them and the data points. (in the first iteration they are selected randomly or by other initialization).
When the process is ended we recalculate for each data point the distance from it to the centroids and the min distance will determine which set this data point is part of. Therefore the centroids will usually not converge to any data point, actually if that happens it is a sign that your hyperparams can be optimized (decrease k for instance).