I am using pandas for this work.
I have a 2 datasets. The first dataset has approximately 6 million rows and 6 columns. For example the first data set looks something like this:
Date | Time | U | V | W | T |
---|---|---|---|---|---|
2020-12-30 | 2:34 | 3 | 4 | 5 | 7 |
2020-12-30 | 2:35 | 2 | 3 | 6 | 5 |
2020-12-30 | 2:36 | 1 | 5 | 8 | 5 |
2020-12-30 | 2:37 | 2 | 3 | 0 | 8 |
2020-12-30 | 2:38 | 4 | 4 | 5 | 7 |
2020-12-30 | 2:39 | 5 | 6 | 5 | 9 |
this is just the raw data collected from the machine.
The second is the average values of three rows at a time from each column (U,V,W,T).
U | V | W | T |
---|---|---|---|
2 | 4 | 6.33 | 5.67 |
3.66 | 4.33 | 3.33 | 8 |
What I am trying to do is calculate the perturbation for each column per second.
U(perturbation)=U(raw)-U(avg)
U(raw)= dataset 1
U(avg)= dataset 2
Basically take the first three rows from the first column of the first dataset and individually subtract them by the first value in the first column of the second dataset, then take the next three values from the first column of the first data set and individually subtract them by second value in the first column of the second dataset. Do the same for all three columns.
The desired final output should be as the following:
Date | Time | U | V | W | T |
---|---|---|---|---|---|
2020-12-30 | 2:34 | 1 | 0 | -1.33 | 1.33 |
2020-12-30 | 2:35 | 0 | -1 | -0.33 | -0.67 |
2020-12-30 | 2:36 | -1 | 1 | 1.67 | -0.67 |
2020-12-30 | 2:37 | -1.66 | -1.33 | -3.33 | 0 |
2020-12-30 | 2:38 | 0.34 | -0.33 | 1.67 | -1 |
2020-12-30 | 2:39 | 1.34 | 1.67 | 1.67 | 1 |
I am new to pandas and do not know how to approach this. I hope it makes sense.
CodePudding user response:
a = df1.assign(index = df1.index // 3).merge(df2.reset_index(), on='index')
b = a.filter(regex = '_x', axis=1) - a.filter(regex = '_y', axis = 1).to_numpy()
pd.concat([a.filter(regex='^[^_] $', axis = 1), b], axis = 1)
Date Time index U_x V_x W_x T_x
0 2020-12-30 2:34 0 0.00 0.00 -1.33 1.33
1 2020-12-30 2:35 0 -1.00 -1.00 -0.33 -0.67
2 2020-12-30 2:36 0 -2.00 1.00 1.67 -0.67
3 2020-12-30 2:37 1 -1.66 -1.33 -3.33 0.00
4 2020-12-30 2:38 1 0.34 -0.33 1.67 -1.00
5 2020-12-30 2:39 1 1.34 1.67 1.67 1.00
CodePudding user response:
You can use numpy:
import numpy as np
df1[df2.columns] -= np.repeat(df2.to_numpy(), 3, axis=0)
NB. This modifies df1
in place, if you want you can make a copy first (df_final = df1.copy()
) and apply the subtraction on this copy.