In this pandas dataframe:
y_train feat1 feat2
0 9.596113 -7.900107
1 -1.384157 2.685313
2 -8.211954 5.214797
How do I go about adding a "distance from Class 0" column at the end of the dataframe, that returns the distance from y_train=0 for each class (i.e. each row)? I want to use class 0 as the reference. In this dataframe, feat 1 = x
and feat 2 = y
I tried:
from sklearn.metrics import pairwise_distances
pairwise_distances(df_centroid['feat1'].values, df_centroid['feat2'].values)
but that gave me an error
ValueError: Expected 2D array, got 1D array instead:
Any help will be greatly appreciated!
Thanks!
CodePudding user response:
pairwise_distances wants a first input X - all the points - and then Y - where we want to compute the distance to.
So for X we have: All the classes. Each feature is one coordinate of its location or in mathematical terms, the class is a vector f = [f0, f1] where each fi is a feature weight.
For Y we have: Class 0. We want to compute the distance to Class 0, for each X.
Just experimenting a bit we can see that
import pandas as pd
import numpy as np
import io
df = pd.read_table(io.StringIO("""
y_train feat1 feat2
0 9.596113 -7.900107
1 -1.384157 2.685313
2 -8.211954 5.214797
"""), sep=r"\s ")
df[['feat1', 'feat2']].to_numpy()
array([[ 9.596113, -7.900107],
[-1.384157, 2.685313],
[-8.211954, 5.214797]])
from sklearn.metrics import pairwise_distances
pairwise_distances(df[['feat1', 'feat2']].to_numpy(), [[ 9.596113, -7.900107]])
# Output
array([[ 0. ],
[15.2518014 ],
[22.11623741]])
Aha, so let's go ahead and do this properly
We want an ndim=2 array as input to pairwise_distances in both cases,
which is the reason I use a 2D slice for .loc
i.e 0:0
.
(And the .to_numpy() equivalents happen automatically, but remember
to think about how pairwise_distances would handle missing data.)
df['distance'] = pairwise_distances(df[['feat1', 'feat2']],
df.loc[0:0, ['feat1', 'feat2']])
df
y_train feat1 feat2 distance
0 0 9.596113 -7.900107 0.000000
1 1 -1.384157 2.685313 15.251801
2 2 -8.211954 5.214797 22.116237
I've just taken the distance metric that you mentioned in your question. Now that you see how it can be used, you're free to replace it with another metric. The sklearn API is quite flexible w.r.t interchangeable algorithms in this way.