I could use some advice how to make a faster code to my problem. I'm looking into how to calculate the correlation between points in space (X,Y,Z) where for each point I have velocity data over time and ideally I would like for each point P1 to calculate the velocity correlation with all other points.
In the end I would like to have a matrix that for each pair of coordinates (X1,Y1,Z1), (X2,Y2,Z2) I get the Pearson correlation coefficient. I'm not entirely sure how to organize this best in python. What I have done so far is that I defined lines of points in different directions and for each line I calculate the correlation between points. This works for the analysis but I end up doing loops that takes a very long time to execute and I think it would be nice to instead just calculate the correlation between all points. Right now I'm using pandas DataFrame and statsmodels to do the correlation (stats.pearsonr(point_X_time.Vx, point_Y_time.Vx) which works but I don't know how to parallelize it efficiently.
I have all the data now in a DataFrame where the head looks like:
Velocity X Y Z Time 0 -12.125850 2.036 0 1.172 10.42 1 -12.516033 2.036 0 1.164 10.42 2 -11.816067 2.028 0 1.172 10.42 3 -10.722124 2.020 0 1.180 10.42 4 -10.628474 2.012 0 1.188 10.42
and the number of rows is ~300 000 rows but could easily be increased if the code would be faster.
CodePudding user response:
Solution 1:
groups = df.groupby(["X", "Y", "Z"])
You group the data by the points in space.
Than you iterate through all the combinations of points and calculate the correlation
import itertools
import numpy as np
for combinations in itertools.combinations(groups.groups.keys(),2):
first = groups.get_group(combinations[0])["Velocity"]
second = groups.get_group(combinations[1])["Velocity"]
if len(first) == len(second):
print(f"{combinations} {np.corrcoef(first, second)[0,1]:.2f}")
Solution 2:
df["cc"] = df.groupby(["X", "Y", "Z"]).cumcount()
df.set_index(["cc","X", "Y", "Z"])
df.unstack(level=[1,2,3])["Velocity"].corr()