Euclidean distance between pandas columns that represent timeseries?-CodePudding

I have a pandas data frame like this. Where the index is pd.DatetimeIndex and the columns are timeseries.

	x_1	x_2	x_3
2020-08-17	133.23	2457.45	-4676
2020-08-18	-982	-6354.56	-245.657
2020-08-19	5678.642	245.2786	2461.785
2020-08-20	-2394	154.34	-735.653
2020-08-20	236	-8876	-698.245

I need to calculate the Euclidean distance of all the columns against each other. I.e., (x_1 - x_2), (x_1 - x_3), (x_2 - x_3), and return a square data frame like this: (Please realize that the values in this table are just an example and not the actual result of the Euclidean distance)

	x_1	x_2	x_3
x_1	0	123	456
x_2	123	0	789
x_3	456	789	0

I tried this resource but I could not figure out how to pass the columns of my df. If understand correctly the example passes the rows as the series to calculate the ED from.

CodePudding user response：

An explicit way of achieving this would be:

from itertools import combinations

import numpy as np

dist_df = pd.DataFrame(index=df.columns, columns=df.columns)

for col_a, col_b in combinations(df.columns, 2):
    dist = np.linalg.norm(df[col_a] - df[col_b])
    dist_df.loc[col_a, col_b] = dist
    dist_df.loc[col_b, col_a] = dist

print(dist_df)

outputs

              x_1           x_2           x_3
x_1           NaN  12381.858429   6135.306973
x_2  12381.858429           NaN  12680.121047
x_3   6135.306973  12680.121047           NaN

If you want 0 instead of NaN use DataFrame.fillna:

dist_df.fillna(0, inplace=True)

CodePudding user response：

The following code will work, with any number of columns.

setup

df = pd.DataFrame(
    {
        "x1":[133.23, -982, 5678.642, -2394, 236],
        "x2":[2457.45, -6354.56, 245.2786, 154.34, -8876],
        "x3":[-4676, -245.657, 2461.785, -735.653, 698.245],
    }
)

solution

import numpy as np

aux = np.broadcast_to(df.values,  (df.shape[1], *df.shape))
result = np.sqrt(np.square(aux - aux.transpose()).sum(axis=1))

result is a numpy.array

You can wrap it up in a dataframe if you wish like this

pd.DataFrame(result, columns=df.columns, index=df.columns)

              x1            x2            x3
x1      0.000000  12381.858429   6081.352512
x2  12381.858429      0.000000  13622.626775
x3   6081.352512  13622.626775      0.000000

Why this approach works is beyond what I'm willing to go into and requires a strong math background. You will need to decide what is more important for you: speed, or readability/understandability.