Find distance between rows in pandas dataframe but with reference to 1 row-CodePudding

In this pandas dataframe:

y_train     feat1   feat2
0   9.596113    -7.900107
1   -1.384157   2.685313
2   -8.211954   5.214797

How do I go about adding a "distance from Class 0" column at the end of the dataframe, that returns the distance from y_train=0 for each class (i.e. each row)? I want to use class 0 as the reference. In this dataframe, feat 1 = x and feat 2 = y

I tried:

from sklearn.metrics import pairwise_distances

pairwise_distances(df_centroid['feat1'].values, df_centroid['feat2'].values)

but that gave me an error

ValueError: Expected 2D array, got 1D array instead:

Any help will be greatly appreciated!

Thanks!

CodePudding user response：

pairwise_distances wants a first input X - all the points - and then Y - where we want to compute the distance to.

So for X we have: All the classes. Each feature is one coordinate of its location or in mathematical terms, the class is a vector f = [f₀, f₁] where each f_i is a feature weight.

For Y we have: Class 0. We want to compute the distance to Class 0, for each X.

Just experimenting a bit we can see that

import pandas as pd
import numpy as np
import io

df = pd.read_table(io.StringIO("""
y_train     feat1   feat2
0   9.596113    -7.900107
1   -1.384157   2.685313
2   -8.211954   5.214797
"""), sep=r"\s ")

df[['feat1', 'feat2']].to_numpy()

array([[ 9.596113, -7.900107],
       [-1.384157,  2.685313],
       [-8.211954,  5.214797]])

from sklearn.metrics import pairwise_distances

pairwise_distances(df[['feat1', 'feat2']].to_numpy(), [[ 9.596113, -7.900107]])

# Output
array([[ 0.        ],
       [15.2518014 ],
       [22.11623741]])

Aha, so let's go ahead and do this properly

We want an ndim=2 array as input to pairwise_distances in both cases, which is the reason I use a 2D slice for .loc i.e 0:0. (And the .to_numpy() equivalents happen automatically, but remember to think about how pairwise_distances would handle missing data.)

df['distance'] = pairwise_distances(df[['feat1', 'feat2']],
                                    df.loc[0:0, ['feat1', 'feat2']])
df

   y_train     feat1     feat2   distance
0        0  9.596113 -7.900107   0.000000
1        1 -1.384157  2.685313  15.251801
2        2 -8.211954  5.214797  22.116237

I've just taken the distance metric that you mentioned in your question. Now that you see how it can be used, you're free to replace it with another metric. The sklearn API is quite flexible w.r.t interchangeable algorithms in this way.