I was following a Machine Learning course, having basic knowledge of Python, following an example in Towards Data Science about K-means Clustering and there is a way of indexing that I didn't ask the professor during the lecture. Source It's the part where the graph is plotted, with the centroids, the author uses indexing like:
plt.scatter(
X[y_km == 2, 0], X[y_km == 2, 1],
s=50, c='lightblue',
marker='v', edgecolor='black',
label='cluster 3'
)
Does anybody know how this works?
I've tried doing it outside of the plt.scatter
, and it isn't helping further than what I already know.
CodePudding user response:
Here is an article that can help you understand ndarray indexing better: Indexing on ndarrays
So in your example X
is 2dim ndarray with n rows and 2 columns - feature1 and feature2.
Simple example:
x = np.arange(20).reshape(10, 2)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]])
and simple example of y
- list of classes:
y = np.array([1, 2] * 5)
array([1, 2, 1, 2, 1, 2, 1, 2, 1, 2])
Let's consider you want to get all rows from X array which correspond to class 1. You can simply do this using boolean array indexing like this:
x[y == 1]
array([[ 0, 1],
[ 4, 5],
[ 8, 9],
[12, 13],
[16, 17]])
But if you want to get all rows of one certain column you have to use dimensional indexing:
x[y == 1, 0] # all rows of feature1 (0 index) corresponding to class 1
array([ 0, 4, 8, 12, 16])
So here y == 1
is all rows and 0
is index of column you are interested in.
CodePudding user response:
X
is an array of 2 columns. You can think of them as x
and y
coordinates.
By printing the first 10 rows, you see:
print(X[0:10])
[[ 2.60509732 1.22529553]
[ 0.5323772 3.31338909]
[ 0.802314 4.38196181]
[ 0.5285368 4.49723858]
[ 2.61858548 0.35769791]
[ 1.59141542 4.90497725]
[ 1.74265969 5.03846671]
[ 2.37533328 0.08918564]
[-2.12133364 2.66447408]
[ 1.72039618 5.25173192]]
y_km
is the classification of these coordinates.
In the example, they are either classified as 0
, 1
, or 2
print(y_km[0:10])
[1 0 0 0 1 0 0 1 2 0]
But when you have y_km == 1
, these are converted to a list of Booleans
print((y_km==1)[0:10])
[ True False False False True False False True False False]
So when you call
X[y_km == 1 , 1]
Essentially, you are asking to select the values of y_km
that are equal to 1
, and map them to column 1
of the X array. It will only grab the rows for which y_km
is equal to True, and only grab the value from the column specified (i.e. 1)
And
X[y_km == 2, 0]
The values of y_km that are equal to 2
, mapped to column 0
of the X array.
So the first number relates to the classification group that you want to gather, and the second number relates to the column of the X array that you want to retrieve from.