Home > Blockchain >  Creating 100 new columns with increasing shift at once
Creating 100 new columns with increasing shift at once

Time:03-20

I'm having a dataset with 30 Mio rows, representing 4000 different time series. Now I want to create 100 new columns containing shifted values (the delta-value 1d before, 2d before and so on).


Data / MWE

import pandas as pd

df = pd.DataFrame({"dtime":["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05",
                            "2022-01-06", "2022-01-07", "2022-01-08", "2022-01-09", "2022-01-10"],
                   "value":[5,6,1,2,3,6,8,7,2,3]})

 ---- ------------ --------- 
|    | dtime      |   value |
|---- ------------ ---------|
|  0 | 2022-01-01 |       5 |
|  1 | 2022-01-02 |       6 |
|  2 | 2022-01-03 |       1 |
|  3 | 2022-01-04 |       2 |
|  4 | 2022-01-05 |       3 |
|  5 | 2022-01-06 |       6 |
|  6 | 2022-01-07 |       8 |
|  7 | 2022-01-08 |       7 |
|  8 | 2022-01-09 |       2 |
|  9 | 2022-01-10 |       3 |
 ---- ------------ --------- 

Solution (but ineffective)

I could do it col by col, but as I need for the real world data 100 (and maybe even more) columns I don't want to do it like that. As I know pandas, there must be a way to do it with a one-, two- or threeliner :-)

df["1d"] = df["value"] - df["value"].shift(1)
df["2d"] = df["value"] - df["value"].shift(2)
df["3d"] = df["value"] - df["value"].shift(3)
df["4d"] = df["value"] - df["value"].shift(4)

Expected Output

 ---- ------------ --------- ------ ------ ------ ------ 
|    | dtime      |   value |   1d |   2d |   3d |   4d |
|---- ------------ --------- ------ ------ ------ ------|
|  0 | 01.01.2022 |       5 |  nan |  nan |  nan |  nan |
|  1 | 02.01.2022 |       6 |    1 |  nan |  nan |  nan |
|  2 | 03.01.2022 |       1 |   -5 |   -4 |  nan |  nan |
|  3 | 04.01.2022 |       2 |    1 |   -4 |   -3 |  nan |
|  4 | 05.01.2022 |       3 |    1 |    2 |   -3 |   -2 |
|  5 | 06.01.2022 |       6 |    3 |    4 |    5 |    0 |
|  6 | 07.01.2022 |       8 |    2 |    5 |    6 |    7 |
|  7 | 08.01.2022 |       7 |   -1 |    1 |    4 |    5 |
|  8 | 09.01.2022 |       2 |   -5 |   -6 |   -4 |   -1 |
|  9 | 10.01.2022 |       3 |    1 |   -4 |   -5 |   -3 |
 ---- ------------ --------- ------ ------ ------ ------ 

CodePudding user response:

I think a simple for loop would suffice:

for i in range(1, 5):
    df[f'{i}d'] = df['value'].diff(i)

print(df)

        dtime  value   1d   2d   3d   4d
0  2022-01-01      5  NaN  NaN  NaN  NaN
1  2022-01-02      6  1.0  NaN  NaN  NaN
2  2022-01-03      1 -5.0 -4.0  NaN  NaN
3  2022-01-04      2  1.0 -4.0 -3.0  NaN
4  2022-01-05      3  1.0  2.0 -3.0 -2.0
5  2022-01-06      6  3.0  4.0  5.0  0.0
6  2022-01-07      8  2.0  5.0  6.0  7.0
7  2022-01-08      7 -1.0  1.0  4.0  5.0
8  2022-01-09      2 -5.0 -6.0 -4.0 -1.0
9  2022-01-10      3  1.0 -4.0 -5.0 -3.0

CodePudding user response:

If you care about efficiency, you can use numpy.

Here is one approach:

N = 4

# convert the column to array
a = df['value'].to_numpy(dtype='float')
# craft an indexing ndarray
b = np.tile(np.arange(len(a)), (N,1)).T-np.arange(1,N 1)
# slice and compute the difference
c = a[:,None]-a[b]
# mask the upper triangle
c[b<0] = np.nan
# assign back to DataFrame
df2 = df.join(pd.DataFrame(c))

output:

        dtime  value    0    1    2    3
0  2022-01-01      5  NaN  NaN  NaN  NaN
1  2022-01-02      6  1.0  NaN  NaN  NaN
2  2022-01-03      1 -5.0 -4.0  NaN  NaN
3  2022-01-04      2  1.0 -4.0 -3.0  NaN
4  2022-01-05      3  1.0  2.0 -3.0 -2.0
5  2022-01-06      6  3.0  4.0  5.0  0.0
6  2022-01-07      8  2.0  5.0  6.0  7.0
7  2022-01-08      7 -1.0  1.0  4.0  5.0
8  2022-01-09      2 -5.0 -6.0 -4.0 -1.0
9  2022-01-10      3  1.0 -4.0 -5.0 -3.0

NB. This runs 5-6 times faster than the loop on the provided dataset.

  • Related