I have a pandas data frame like this. Where the index is pd.DatetimeIndex and the columns are timeseries.
x_1 | x_2 | x_3 | |
---|---|---|---|
2020-08-17 | 133.23 | 2457.45 | -4676 |
2020-08-18 | -982 | -6354.56 | -245.657 |
2020-08-19 | 5678.642 | 245.2786 | 2461.785 |
2020-08-20 | -2394 | 154.34 | -735.653 |
2020-08-20 | 236 | -8876 | -698.245 |
I need to calculate the Euclidean distance of all the columns against each other. I.e., (x_1 - x_2), (x_1 - x_3), (x_2 - x_3), and return a square data frame like this: (Please realize that the values in this table are just an example and not the actual result of the Euclidean distance)
x_1 | x_2 | x_3 | |
---|---|---|---|
x_1 | 0 | 123 | 456 |
x_2 | 123 | 0 | 789 |
x_3 | 456 | 789 | 0 |
I tried this resource but I could not figure out how to pass the columns of my df. If understand correctly the example passes the rows as the series to calculate the ED from.
CodePudding user response:
An explicit way of achieving this would be:
from itertools import combinations
import numpy as np
dist_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_a, col_b in combinations(df.columns, 2):
dist = np.linalg.norm(df[col_a] - df[col_b])
dist_df.loc[col_a, col_b] = dist
dist_df.loc[col_b, col_a] = dist
print(dist_df)
outputs
x_1 x_2 x_3
x_1 NaN 12381.858429 6135.306973
x_2 12381.858429 NaN 12680.121047
x_3 6135.306973 12680.121047 NaN
If you want 0
instead of NaN
use DataFrame.fillna
:
dist_df.fillna(0, inplace=True)
CodePudding user response:
The following code will work, with any number of columns.
setup
df = pd.DataFrame(
{
"x1":[133.23, -982, 5678.642, -2394, 236],
"x2":[2457.45, -6354.56, 245.2786, 154.34, -8876],
"x3":[-4676, -245.657, 2461.785, -735.653, 698.245],
}
)
solution
import numpy as np
aux = np.broadcast_to(df.values, (df.shape[1], *df.shape))
result = np.sqrt(np.square(aux - aux.transpose()).sum(axis=1))
result
is a numpy.array
You can wrap it up in a dataframe if you wish like this
pd.DataFrame(result, columns=df.columns, index=df.columns)
x1 x2 x3
x1 0.000000 12381.858429 6081.352512
x2 12381.858429 0.000000 13622.626775
x3 6081.352512 13622.626775 0.000000
Why this approach works is beyond what I'm willing to go into and requires a strong math background. You will need to decide what is more important for you: speed, or readability/understandability.