How does indexing with comma work on Python's plt?-CodePudding

I was following a Machine Learning course, having basic knowledge of Python, following an example in Towards Data Science about K-means Clustering and there is a way of indexing that I didn't ask the professor during the lecture. Source It's the part where the graph is plotted, with the centroids, the author uses indexing like:

plt.scatter(
    X[y_km == 2, 0], X[y_km == 2, 1],
    s=50, c='lightblue',
    marker='v', edgecolor='black',
    label='cluster 3'
)

Does anybody know how this works?

I've tried doing it outside of the plt.scatter, and it isn't helping further than what I already know.

CodePudding user response：

Here is an article that can help you understand ndarray indexing better: Indexing on ndarrays

So in your example X is 2dim ndarray with n rows and 2 columns - feature1 and feature2. Simple example:

x = np.arange(20).reshape(10, 2)

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19]])

and simple example of y - list of classes:

y = np.array([1, 2] * 5)
array([1, 2, 1, 2, 1, 2, 1, 2, 1, 2])

Let's consider you want to get all rows from X array which correspond to class 1. You can simply do this using boolean array indexing like this:

x[y == 1]

array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13],
       [16, 17]])

But if you want to get all rows of one certain column you have to use dimensional indexing:

x[y == 1, 0] # all rows of feature1 (0 index) corresponding to class 1

array([ 0,  4,  8, 12, 16])

So here y == 1 is all rows and 0 is index of column you are interested in.

CodePudding user response：

X is an array of 2 columns. You can think of them as x and y coordinates.
By printing the first 10 rows, you see:

print(X[0:10])

[[ 2.60509732  1.22529553]
 [ 0.5323772   3.31338909]
 [ 0.802314    4.38196181]
 [ 0.5285368   4.49723858]
 [ 2.61858548  0.35769791]
 [ 1.59141542  4.90497725]
 [ 1.74265969  5.03846671]
 [ 2.37533328  0.08918564]
 [-2.12133364  2.66447408]
 [ 1.72039618  5.25173192]]

y_km is the classification of these coordinates.
In the example, they are either classified as 0, 1, or 2

print(y_km[0:10])

[1 0 0 0 1 0 0 1 2 0]

But when you have y_km == 1, these are converted to a list of Booleans

print((y_km==1)[0:10])

[ True False False False True False False True False False]

So when you call

X[y_km == 1 , 1]

Essentially, you are asking to select the values of y_km that are equal to 1, and map them to column 1 of the X array. It will only grab the rows for which y_km is equal to True, and only grab the value from the column specified (i.e. 1)

And

X[y_km == 2, 0]

The values of y_km that are equal to 2, mapped to column 0 of the X array.

So the first number relates to the classification group that you want to gather, and the second number relates to the column of the X array that you want to retrieve from.