Lets say I have the given dataframe
feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8
0 0.862874 0.392938 0.669744 0.939903 0.382574 0.780595 0.049201 0.627703
1 0.942322 0.676181 0.223476 0.102698 0.620883 0.834038 0.966355 0.554645
2 0.940375 0.310532 0.975096 0.600778 0.893220 0.282508 0.837575 0.112575
3 0.868902 0.818175 0.102860 0.936395 0.406088 0.619990 0.913905 0.597607
4 0.143344 0.207751 0.835707 0.414900 0.360534 0.525631 0.228751 0.294437
5 0.339856 0.501197 0.671033 0.302202 0.406512 0.997044 0.080621 0.068071
6 0.521056 0.343654 0.812553 0.393159 0.217987 0.247602 0.671783 0.254299
7 0.594744 0.180041 0.884603 0.578050 0.441461 0.176732 0.569595 0.391923
8 0.402864 0.062175 0.565858 0.349415 0.106725 0.323310 0.153594 0.277930
9 0.480539 0.540283 0.248376 0.252237 0.229181 0.092273 0.546501 0.201396
And I would like to find clusters in these rows. To do so, I want to use Kmeans. However, I would like to find clusters by giving more importance to [feature_1, feature_2] than to the other features in the dataframe. Lets say an importance coefficient of 0.5 for [feature_1, feature_2] , and 0.5 for the remaining features.
I thought about transforming [feature_3, ..., feature_8] into a single column by using PCA. By doing so, I imagine that the Kmeans would give less importance to a single feature than to 6 separated features.
Is it a good idea ? Do you see better ways of giving this information to the algorithm ?
CodePudding user response:
What Kmeans does is it tries to find centroids and assigns points to those centroids that have the smallest euclidean distance to the centroid. When minimizing euclidean distances or using them as loss functions in machine learning, one should in general make sure that different features have the same scale. Otherwise larger features would dominate in finding the closest points. That's why we normally do some scaling before training our models.
However, in your case, you could make use of that by first bringing all features onto the same scale using some minmax or standarscaler, and after that either scale up the first 2 features by a factor > 1 or scale down the remaining 6 features by a factor < 1.