Given a dataframe df
, how to calculate the rolling count of unique values through rows' direction subject to a boundary condition: window size = n?
Input data:
import pandas as pd
import numpy as np
data = {'col_0':[7, 8, 9, 10, 11, 12],
'col_1':[4, 5, 6, 7, 8, 9],
'col_2':[2, 5, 8, 11, 14, 15],
'col_3':[2, 6, 10, 14, 18, 21],
'col_4':[7, 5, 7, 5, 7, 5],
'col_5':[2, 6, 10, 14, 18, 21]}
df = pd.DataFrame(data)
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5
0 7 4 2 2 7 2
1 8 5 5 6 5 6
2 9 6 8 10 7 10
3 10 7 11 14 5 14
4 11 8 14 18 7 18
5 12 9 15 21 5 21
Expected output (with window size = 2):
print(df)
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
For the example above with window size = 2.
At
window 0
's array we have row[0].[[7 4 2 2 7 2]]
rolling_nunique[0]
is 3 with the elements being [2, 4, 7].At
window 1
's array we have row[0] & row[1].[[7 4 2 2 7 2] [8 5 5 6 5 6]]
rolling_nunique[1]
is 6 with the elements being [2, 4, 5, 6, 7, 8].At
window 2
's array we have row[1] & row[2].[[ 8 5 5 6 5 6] [ 9 6 8 10 7 10]]
rolling_nunique[2]
is 6 with the elements being [5, 6, 7, 8, 9, 10].etc.
CodePudding user response:
Using sliding_window_view
, you can customize how the values are aggregated in the sliding window. To get values for all rows before the sliding window is full (i.e., emulate min_periods=1
in pd.rolling
), we need to add some empty rows at the top. This can be done using vstack
and full
. At the end, we need to account for these added nan values by filtering them away.
from numpy.lib.stride_tricks import sliding_window_view
w = 2
values = np.vstack([np.full([w-1, df.shape[1]], np.nan), df.values])
m = sliding_window_view(values, w, axis=0).reshape(len(df), -1)
unique_count = [len(np.unique(r[~np.isnan(r)])) for r in m]
df['rolling_nunique'] = unique_count
Result:
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
CodePudding user response:
I found it could resolve by using sliding_window_view()
from numpy
,
Here's the approach:
rolling = 2
ar = df.values # turn into np.ndarray
length = ar.shape[1]
head_arrs = np.zeros((rolling-1, rolling*length))
cuboid = np.lib.stride_tricks.sliding_window_view(ar, (rolling,length)).astype(float)
plane = cuboid.reshape(-1,rolling*length)
for i in range(rolling-1,0,-1):
head_arr_l = plane[0,:i*length]
head_arr_l = np.pad(head_arr_l.astype(float), (0,length*(rolling-i)), 'constant', constant_values=np.nan)
head_arr_l = np.roll(head_arr_l, length*(rolling-i))
head_arrs[i-1,:] = head_arr_l
plane = np.insert(plane, 0, head_arrs, axis=0)
df['rolling_nunique'] = pd.DataFrame(plane).nunique(axis=1)
df
###
col_0 col_1 col_2 col_3 col_4 col_5 rolling_nunique
0 7 4 2 2 7 2 3
1 8 5 5 6 5 6 6
2 9 6 8 10 7 10 6
3 10 7 11 14 5 14 8
4 11 8 14 18 7 18 7
5 12 9 15 21 5 21 10
[reference] numpy.lib.stride_tricks.sliding_window_view