I have a time-series dataframe df
with zeros and ones and I aim to compare the three most recent days df_recent
with the historic time-series of df
. In order to do that, I try to build pairwise differences of df_recent
and df
on each day with a rolling window and sum them up for each ID
column.
df:
ID1 ID2 ID3
Date
2022-01-01 0 1 0
2022-01-02 0 1 0
2022-01-03 1 1 1
2022-01-04 0 0 0
2022-01-05 1 1 1
2022-01-06 0 0 1
2022-01-07 1 1 0
2022-01-08 0 0 0
2022-01-09 1 0 1 <-- 1
2022-01-10 0 1 1 <-- 2
2022-01-11 1 1 0 <-- 3
2022-01-12 1 0 0
df_recent:
ID1 ID2 ID3
Date
2022-01-10 0 1 1 <-- 1
2022-01-11 1 1 0 <-- 2
2022-01-12 1 0 0 <-- 3
df_pairwise_difference:
ID1 ID2 ID3
Date
2022-01-03 1 1 2
2022-01-04 1 0 2
2022-01-05 2 2 1
2022-01-06 1 1 3
2022-01-07 2 2 1
2022-01-08 1 1 0
2022-01-09 2 1 2
2022-01-10 1 3 3
2022-01-11 2 2 1 <-- example pairwise difference
2022-01-12 0 0 0
I try to build the difference by comparing pairwise the values (for an interval of three days) from df_recent
with df
on each day. An example for the pairwiese difference on 2022-01-11
: 2 2 1:
- ID1: 2 = 1 (pairwise difference marked line 1) 1 (pairwise difference marked line 2) 0 (pairwise difference marked line 3)
- ID2: 2 = 1 (pairwise difference marked line 1) 0 (pairwise difference marked line 2) 1 (pairwise difference marked line 3)
- ID3: 1 = 0 (pairwise difference marked line 1) 1 (pairwise difference marked line 2) 0 (pairwise difference marked line 3)
I thought about a for loop, but wasn't able to come up with something useful.
Many thanks for your suggestion!
For reproducability:
import pandas as pd
df = pd.DataFrame({
'Date':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12'],
'ID1':[0,0,1,0,1,0,1,0,1,0,1,1],
'ID2':[1,1,1,0,1,0,1,0,0,1,1,0],
'ID3':[0,0,1,0,1,1,0,0,1,1,0,0]})
df = df.set_index('Date')
df_recent = df.iloc[-3:]
df_difference = pd.DataFrame({
'Date':['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12'],
'ID1':[1,1,2,1,2,1,2,1,2,0],
'ID2':[1,0,2,1,2,1,1,3,2,0],
'ID3':[2,2,1,3,1,0,2,3,1,0]})
df_difference = df_difference.set_index('Date')
CodePudding user response:
If you can install numba
, here is one solution with .rolling
using the 'table'
method that allows you to roll over all columns at once:
import numpy as np
recent_array = df_recent.to_numpy()
def pairwise(arr):
if len(arr) < len(recent_array):
return np.array([np.nan, np.nan, np.nan])
return np.sum(np.abs(arr - recent_array), axis=0)
df_pairwise_difference = df.rolling(len(df_recent), method='table').apply(pairwise, engine='numba', raw=True)
print(df_pairwise_difference.dropna())
ID1 ID2 ID3
Date
2022-01-03 1.0 1.0 2.0
2022-01-04 1.0 0.0 2.0
2022-01-05 2.0 2.0 1.0
2022-01-06 1.0 1.0 3.0
2022-01-07 2.0 2.0 1.0
2022-01-08 1.0 1.0 0.0
2022-01-09 2.0 1.0 2.0
2022-01-10 1.0 3.0 3.0
2022-01-11 2.0 2.0 1.0
2022-01-12 0.0 0.0 0.0
Alternatively, you can iterate the windows:
import numpy as np
recent_array = df_recent.to_numpy()
df_pairwise_difference = df.copy()
for w in df.rolling(len(recent_array)):
if len(w) < len(recent_array):
df_pairwise_difference.loc[w.index[-1], w.columns] = np.nan
else:
df_pairwise_difference.loc[w.index[-1], w.columns] = np.sum(np.abs(w.to_numpy() - recent_array), axis=0)
print(df_pairwise_difference.dropna())
ID1 ID2 ID3
Date
2022-01-03 1.0 1.0 2.0
2022-01-04 1.0 0.0 2.0
2022-01-05 2.0 2.0 1.0
2022-01-06 1.0 1.0 3.0
2022-01-07 2.0 2.0 1.0
2022-01-08 1.0 1.0 0.0
2022-01-09 2.0 1.0 2.0
2022-01-10 1.0 3.0 3.0
2022-01-11 2.0 2.0 1.0
2022-01-12 0.0 0.0 0.0
The numba
version is fast but has an initial overhead to load the library and translate the code. So, use it if your dataset is very large, otherwise the "classic" iterative version will still be faster because it has no initial overhead.