Here is my large panel dataset:
Date | x1 | x2 | x3 |
---|---|---|---|
2017-07-20 | 50 | 60 | Kevin |
2017-07-21 | 51 | 80 | Kevin |
2016-05-23 | 100 | 200 | Cathy |
2016-04-20 | 20 | 20 | Cathy |
2019-01-02 | 50 | 60 | Leo |
This dataset contains billions of rows. What I would like to do is that I would like to calculate the 1-day different in terms of percentage for x1 and x2: Denote t and t 1 to the time representing today and tomorrow. I would like to calculate (x1_{t 1} - x2_t) / x2_t
First I used the fastest way in terms of writing:
I created a nested list containing all the target values of each group of x3
:
nested_list = []
flatten_list = []
for group in df.x3.unique():
df_ = df[df.x3 == group]
nested_list.append((df_.x1.shift(-1) / df_.x2) / df_.x2))
for lst in nested_list:
for i in lst:
flatten_list.append(i)
df["target"] = flatten_list
However, this method will literality take a year to run, which is not implementable.
I also tried the native pandas groupby
method for potentially runnable outcome but it DID NOT seem to work:
def target_calculation(x):
target = (x.x1.shift(-1) - x.x2) / x.x2
return target
df["target"] = df.groupby("x3")[["x1", "x2"]].apply(target_calculation)
How can I calculate this without using for loop or possibly vectorize the whole process?
CodePudding user response:
You could groupby
shift
"x1" and subtract "x2" from it:
df['target'] = (df.groupby('x3')['x1'].shift(-1) - df['x2']) / df['x2']
Output:
Date x1 x2 x3 target
0 2017-07-20 50 60 Kevin -0.15
1 2017-07-21 51 80 Kevin NaN
2 2016-05-23 100 200 Cathy -0.90
3 2016-04-20 20 20 Cathy NaN
4 2019-01-02 50 60 Leo NaN
Note that
(df.groupby('x3')['x1'].shift(-1) / df['x2']) / df['x2']
produces the output equivalent to flatten_list
but I don't think this is your true desired output but rather a typo.