what do these commands do in the digits dataset clustering demonstration?-CodePudding

I have been looking at this fitting a digits dataset to a k-means cluster on Python tutorial here, and some of the codes are just confusing me.

I do understand this part where we need to train our model using 10 clusters.

from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

The following show us an output of the 10 cluster centroids on the console. it first creates figure and axes which has two row, each row has 5 axes subplots return the figure and (8,3) is the size of the figure displaying on the console. But after that I just do not understand how the command shows the output of cluster centroids in the for loop.

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Also, this part is to check how accurate the clustering was in finding the similar digits within the data. I know that we need to create a labels that has the same size as the clusters filling with zero so we can place our predicted label in there. But again, I just do not understand how do they implement it inside the for-loop.

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

Can someone please explain what each line of the commands do? Thank you.

CodePudding user response：

Question 1: How does the code plot the centroids?

It's important to see that each centroid is a point in the feature space. In other words, a centroid looks like one of the training samples. In this case, each training sample is an 8 × 8 image (although they've been flattened into rows with 64 elements (because sklearn always wants input X to be a two-dimensional array). So each centroid also represents an 8 × 8 image.

The loop steps over the axes (a 2 × 5 matrix) and the centroids (kmeans.cluster_centers_ together. The purpose of zip is to ensure that for each Axes object there is a corresponding center (this is a common way to plot a bunch of n things into a bunch of n subplots). The centroids have been reshaped into a 10 × 8 × 8 array, so that each of the 10 centroids is the 8 × 8 image we're expecting.

Since each centroid is now a 2D array, you can use imshow to plot it.

Question 2: How does the code assign labels?

The easiest thing might be to take the code apart and run bits of it on their own. For example, take a look at clusters == 0. This is a Boolean array. You can use Boolean arrays to index other arrays of the same shape. The first line of code in the loop assigns this array to mask so we can use it.

Then we index into labels using the Boolean array (try it!) to say, "Change these values to the mode average of the corresponding elements of the label vector, i.e. digits.target." The index [0] is just needed because of what the scipy.stats.mode() function returns (again, try it out).