Create pairwise difference of rolling window of two dataframes-CodePudding

I have a time-series dataframe df with zeros and ones and I aim to compare the three most recent days df_recent with the historic time-series of df. In order to do that, I try to build pairwise differences of df_recent and df on each day with a rolling window and sum them up for each ID column.

df:
            ID1 ID2 ID3
Date            
2022-01-01  0   1   0
2022-01-02  0   1   0
2022-01-03  1   1   1
2022-01-04  0   0   0
2022-01-05  1   1   1
2022-01-06  0   0   1
2022-01-07  1   1   0
2022-01-08  0   0   0
2022-01-09  1   0   1 <-- 1
2022-01-10  0   1   1 <-- 2
2022-01-11  1   1   0 <-- 3
2022-01-12  1   0   0

df_recent:
            ID1 ID2 ID3
Date            
2022-01-10  0   1   1 <-- 1 
2022-01-11  1   1   0 <-- 2
2022-01-12  1   0   0 <-- 3

df_pairwise_difference:
            ID1 ID2 ID3
Date            
2022-01-03  1   1   2
2022-01-04  1   0   2
2022-01-05  2   2   1
2022-01-06  1   1   3
2022-01-07  2   2   1
2022-01-08  1   1   0
2022-01-09  2   1   2
2022-01-10  1   3   3
2022-01-11  2   2   1 <-- example pairwise difference
2022-01-12  0   0   0

I try to build the difference by comparing pairwise the values (for an interval of three days) from df_recent with df on each day. An example for the pairwiese difference on 2022-01-11: 2 2 1:

ID1: 2 = 1 (pairwise difference marked line 1) 1 (pairwise difference marked line 2) 0 (pairwise difference marked line 3)
ID2: 2 = 1 (pairwise difference marked line 1) 0 (pairwise difference marked line 2) 1 (pairwise difference marked line 3)
ID3: 1 = 0 (pairwise difference marked line 1) 1 (pairwise difference marked line 2) 0 (pairwise difference marked line 3)

I thought about a for loop, but wasn't able to come up with something useful.
Many thanks for your suggestion!

For reproducability:

import pandas as pd
df = pd.DataFrame({
    'Date':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', 
            '2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12'],
    'ID1':[0,0,1,0,1,0,1,0,1,0,1,1], 
    'ID2':[1,1,1,0,1,0,1,0,0,1,1,0],
    'ID3':[0,0,1,0,1,1,0,0,1,1,0,0]})
df = df.set_index('Date')

df_recent = df.iloc[-3:]

df_difference = pd.DataFrame({
    'Date':['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08', 
            '2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12'],
    'ID1':[1,1,2,1,2,1,2,1,2,0], 
    'ID2':[1,0,2,1,2,1,1,3,2,0],
    'ID3':[2,2,1,3,1,0,2,3,1,0]})
df_difference = df_difference.set_index('Date')

CodePudding user response：

If you can install numba, here is one solution with .rolling using the 'table' method that allows you to roll over all columns at once:

import numpy as np
recent_array = df_recent.to_numpy()

def pairwise(arr):
    if len(arr) < len(recent_array):
        return np.array([np.nan, np.nan, np.nan])
    return np.sum(np.abs(arr - recent_array), axis=0)

df_pairwise_difference = df.rolling(len(df_recent), method='table').apply(pairwise, engine='numba', raw=True)
print(df_pairwise_difference.dropna())

            ID1  ID2  ID3
Date                     
2022-01-03  1.0  1.0  2.0
2022-01-04  1.0  0.0  2.0
2022-01-05  2.0  2.0  1.0
2022-01-06  1.0  1.0  3.0
2022-01-07  2.0  2.0  1.0
2022-01-08  1.0  1.0  0.0
2022-01-09  2.0  1.0  2.0
2022-01-10  1.0  3.0  3.0
2022-01-11  2.0  2.0  1.0
2022-01-12  0.0  0.0  0.0

Alternatively, you can iterate the windows:

import numpy as np
recent_array = df_recent.to_numpy()

df_pairwise_difference = df.copy()

for w in df.rolling(len(recent_array)):
    if len(w) < len(recent_array):
        df_pairwise_difference.loc[w.index[-1], w.columns] = np.nan
    else:
        df_pairwise_difference.loc[w.index[-1], w.columns] = np.sum(np.abs(w.to_numpy() - recent_array), axis=0)

print(df_pairwise_difference.dropna())

            ID1  ID2  ID3
Date                     
2022-01-03  1.0  1.0  2.0
2022-01-04  1.0  0.0  2.0
2022-01-05  2.0  2.0  1.0
2022-01-06  1.0  1.0  3.0
2022-01-07  2.0  2.0  1.0
2022-01-08  1.0  1.0  0.0
2022-01-09  2.0  1.0  2.0
2022-01-10  1.0  3.0  3.0
2022-01-11  2.0  2.0  1.0
2022-01-12  0.0  0.0  0.0

The numba version is fast but has an initial overhead to load the library and translate the code. So, use it if your dataset is very large, otherwise the "classic" iterative version will still be faster because it has no initial overhead.