The algorithm for the K-means is:
- Take one centroid c(i), chosen uniformly at random from the dataset.
- Take a new Centroid c(i), choosing an instance x(i) from the dataset with the probability
D(X(i))^2/Sum(D(X(j))^2)
from j=1 to m, whereD(X(i))
is the distance between the instance and the closest centroid which is selected.
What is this parameter m
used in the summation of the probability?
CodePudding user response:
It might have been helpful to see the original formulation, but the algorithm is quite clear: during the initialization phase, for each point not used as a centroid, calculate the distance between said point and the nearest centroid, that will be the distance D(X[i]), the pick a random point in this set of points with probability weighted with D(X[i])^2
In your formulation it seems you got m points unused.
CodePudding user response:
I am not sure I understand what you have written in the question with the formula. m seems to be the number of data points in the centroid.
Then, for each data point we assign the point to the nearest of the centroids.
We repeat this process of moving the centroid and reassigning the data points until the centroid stops moving.