I'm having a dataset with 30 Mio rows, representing 4000 different time series. Now I want to create 100 new columns containing shifted values (the delta-value 1d before, 2d before and so on).
Data / MWE
import pandas as pd
df = pd.DataFrame({"dtime":["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05",
"2022-01-06", "2022-01-07", "2022-01-08", "2022-01-09", "2022-01-10"],
"value":[5,6,1,2,3,6,8,7,2,3]})
---- ------------ ---------
| | dtime | value |
|---- ------------ ---------|
| 0 | 2022-01-01 | 5 |
| 1 | 2022-01-02 | 6 |
| 2 | 2022-01-03 | 1 |
| 3 | 2022-01-04 | 2 |
| 4 | 2022-01-05 | 3 |
| 5 | 2022-01-06 | 6 |
| 6 | 2022-01-07 | 8 |
| 7 | 2022-01-08 | 7 |
| 8 | 2022-01-09 | 2 |
| 9 | 2022-01-10 | 3 |
---- ------------ ---------
Solution (but ineffective)
I could do it col by col, but as I need for the real world data 100 (and maybe even more) columns I don't want to do it like that. As I know pandas, there must be a way to do it with a one-, two- or threeliner :-)
df["1d"] = df["value"] - df["value"].shift(1)
df["2d"] = df["value"] - df["value"].shift(2)
df["3d"] = df["value"] - df["value"].shift(3)
df["4d"] = df["value"] - df["value"].shift(4)
Expected Output
---- ------------ --------- ------ ------ ------ ------
| | dtime | value | 1d | 2d | 3d | 4d |
|---- ------------ --------- ------ ------ ------ ------|
| 0 | 01.01.2022 | 5 | nan | nan | nan | nan |
| 1 | 02.01.2022 | 6 | 1 | nan | nan | nan |
| 2 | 03.01.2022 | 1 | -5 | -4 | nan | nan |
| 3 | 04.01.2022 | 2 | 1 | -4 | -3 | nan |
| 4 | 05.01.2022 | 3 | 1 | 2 | -3 | -2 |
| 5 | 06.01.2022 | 6 | 3 | 4 | 5 | 0 |
| 6 | 07.01.2022 | 8 | 2 | 5 | 6 | 7 |
| 7 | 08.01.2022 | 7 | -1 | 1 | 4 | 5 |
| 8 | 09.01.2022 | 2 | -5 | -6 | -4 | -1 |
| 9 | 10.01.2022 | 3 | 1 | -4 | -5 | -3 |
---- ------------ --------- ------ ------ ------ ------
CodePudding user response:
I think a simple for loop would suffice:
for i in range(1, 5):
df[f'{i}d'] = df['value'].diff(i)
print(df)
dtime value 1d 2d 3d 4d
0 2022-01-01 5 NaN NaN NaN NaN
1 2022-01-02 6 1.0 NaN NaN NaN
2 2022-01-03 1 -5.0 -4.0 NaN NaN
3 2022-01-04 2 1.0 -4.0 -3.0 NaN
4 2022-01-05 3 1.0 2.0 -3.0 -2.0
5 2022-01-06 6 3.0 4.0 5.0 0.0
6 2022-01-07 8 2.0 5.0 6.0 7.0
7 2022-01-08 7 -1.0 1.0 4.0 5.0
8 2022-01-09 2 -5.0 -6.0 -4.0 -1.0
9 2022-01-10 3 1.0 -4.0 -5.0 -3.0
CodePudding user response:
If you care about efficiency, you can use numpy.
Here is one approach:
N = 4
# convert the column to array
a = df['value'].to_numpy(dtype='float')
# craft an indexing ndarray
b = np.tile(np.arange(len(a)), (N,1)).T-np.arange(1,N 1)
# slice and compute the difference
c = a[:,None]-a[b]
# mask the upper triangle
c[b<0] = np.nan
# assign back to DataFrame
df2 = df.join(pd.DataFrame(c))
output:
dtime value 0 1 2 3
0 2022-01-01 5 NaN NaN NaN NaN
1 2022-01-02 6 1.0 NaN NaN NaN
2 2022-01-03 1 -5.0 -4.0 NaN NaN
3 2022-01-04 2 1.0 -4.0 -3.0 NaN
4 2022-01-05 3 1.0 2.0 -3.0 -2.0
5 2022-01-06 6 3.0 4.0 5.0 0.0
6 2022-01-07 8 2.0 5.0 6.0 7.0
7 2022-01-08 7 -1.0 1.0 4.0 5.0
8 2022-01-09 2 -5.0 -6.0 -4.0 -1.0
9 2022-01-10 3 1.0 -4.0 -5.0 -3.0
NB. This runs 5-6 times faster than the loop on the provided dataset.