Home > Software engineering >  Pandas Time series manipulation with large panel data
Pandas Time series manipulation with large panel data

Time:03-11

Here is my large panel dataset:

Date x1 x2 x3
2017-07-20 50 60 Kevin
2017-07-21 51 80 Kevin
2016-05-23 100 200 Cathy
2016-04-20 20 20 Cathy
2019-01-02 50 60 Leo

This dataset contains billions of rows. What I would like to do is that I would like to calculate the 1-day different in terms of percentage for x1 and x2: Denote t and t 1 to the time representing today and tomorrow. I would like to calculate (x1_{t 1} - x2_t) / x2_t

First I used the fastest way in terms of writing:

I created a nested list containing all the target values of each group of x3:

nested_list = []
flatten_list = []

for group in df.x3.unique():
    df_ = df[df.x3 == group]
    nested_list.append((df_.x1.shift(-1) / df_.x2) / df_.x2))
for lst in nested_list:
    for i in lst:
        flatten_list.append(i)

df["target"] = flatten_list

However, this method will literality take a year to run, which is not implementable.

I also tried the native pandas groupby method for potentially runnable outcome but it DID NOT seem to work:

def target_calculation(x):
    target = (x.x1.shift(-1) - x.x2) / x.x2
    return target

df["target"] = df.groupby("x3")[["x1", "x2"]].apply(target_calculation)

How can I calculate this without using for loop or possibly vectorize the whole process?

CodePudding user response:

You could groupby shift "x1" and subtract "x2" from it:

df['target'] = (df.groupby('x3')['x1'].shift(-1) - df['x2']) / df['x2']

Output:

         Date   x1   x2     x3  target
0  2017-07-20   50   60  Kevin   -0.15
1  2017-07-21   51   80  Kevin     NaN
2  2016-05-23  100  200  Cathy   -0.90
3  2016-04-20   20   20  Cathy     NaN
4  2019-01-02   50   60    Leo     NaN

Note that

(df.groupby('x3')['x1'].shift(-1) / df['x2']) / df['x2']

produces the output equivalent to flatten_list but I don't think this is your true desired output but rather a typo.

  • Related