K-Means algorithm Centroids are not placed in the clusters-CodePudding

I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read

inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)

print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')

[OUT]

inertia 4490.076
silhouette 0.156

CodePudding user response：

The answer to your main question: the cluster centers are not outside of your clusters.

1 : You are clustering over 14 features shown in features_clustering list.

2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.

For these reasons you are going to get strange results; they really don't mean anything.

The bottom line is you cannot view 14 dimension clustering over two-dimensions.

To show point 2 more clearly, change the plotting of the clusters line to

sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)

to be plotting the cluster centers against the same features as the data.

The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.

CodePudding user response：

You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example) x1 is "amenities_count", x5 is "corrected_price".

It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.

You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.

For the first case, I am not very sure because I have never done it (Remapping the data). But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.

Moral: if you want to see a 2D cluster, make sure you only have 2 variables.