Home > OS >  Rolling count unique in dataframe's rows or ndarray
Rolling count unique in dataframe's rows or ndarray

Time:08-11

Given a dataframe df, how to calculate the rolling count of unique values through rows' direction subject to a boundary condition: window size = n?

Input data:

import pandas as pd
import numpy as np

data = {'col_0':[7, 8, 9, 10, 11, 12],
        'col_1':[4, 5, 6, 7, 8, 9],
        'col_2':[2, 5, 8, 11, 14, 15],
        'col_3':[2, 6, 10, 14, 18, 21],
        'col_4':[7, 5, 7, 5, 7, 5],
        'col_5':[2, 6, 10, 14, 18, 21]}
df = pd.DataFrame(data)
print(df)
###
   col_0  col_1  col_2  col_3  col_4  col_5
0      7      4      2      2      7      2
1      8      5      5      6      5      6
2      9      6      8     10      7     10
3     10      7     11     14      5     14
4     11      8     14     18      7     18
5     12      9     15     21      5     21

Expected output (with window size = 2):

print(df)
###
   col_0  col_1  col_2  col_3  col_4  col_5  rolling_nunique
0      7      4      2      2      7      2                3
1      8      5      5      6      5      6                6
2      9      6      8     10      7     10                6
3     10      7     11     14      5     14                8
4     11      8     14     18      7     18                7
5     12      9     15     21      5     21               10

For the example above with window size = 2.

  • At window 0's array we have row[0].

    [[7 4 2 2 7 2]]
    

    rolling_nunique[0] is 3 with the elements being [2, 4, 7].

  • At window 1's array we have row[0] & row[1].

    [[7 4 2 2 7 2]
    [8 5 5 6 5 6]]
    

    rolling_nunique[1] is 6 with the elements being [2, 4, 5, 6, 7, 8].

  • At window 2's array we have row[1] & row[2].

    [[ 8  5  5  6  5  6]
    [ 9  6  8 10  7 10]]
    

    rolling_nunique[2] is 6 with the elements being [5, 6, 7, 8, 9, 10].

  • etc.

CodePudding user response:

Using sliding_window_view, you can customize how the values are aggregated in the sliding window. To get values for all rows before the sliding window is full (i.e., emulate min_periods=1 in pd.rolling), we need to add some empty rows at the top. This can be done using vstack and full. At the end, we need to account for these added nan values by filtering them away.

from numpy.lib.stride_tricks import sliding_window_view

w = 2
values = np.vstack([np.full([w-1, df.shape[1]], np.nan), df.values])
m = sliding_window_view(values, w, axis=0).reshape(len(df), -1)
unique_count = [len(np.unique(r[~np.isnan(r)])) for r in m]
df['rolling_nunique'] = unique_count

Result:


   col_0  col_1  col_2  col_3  col_4  col_5  rolling_nunique
0      7      4      2      2      7      2                3
1      8      5      5      6      5      6                6
2      9      6      8     10      7     10                6
3     10      7     11     14      5     14                8
4     11      8     14     18      7     18                7
5     12      9     15     21      5     21               10

CodePudding user response:

I found it could resolve by using sliding_window_view() from numpy,
Here's the approach:

rolling = 2
ar = df.values # turn into np.ndarray
length = ar.shape[1]
head_arrs = np.zeros((rolling-1, rolling*length))
cuboid = np.lib.stride_tricks.sliding_window_view(ar, (rolling,length)).astype(float)
plane = cuboid.reshape(-1,rolling*length)

for i in range(rolling-1,0,-1):
        head_arr_l = plane[0,:i*length]
        head_arr_l = np.pad(head_arr_l.astype(float), (0,length*(rolling-i)), 'constant', constant_values=np.nan)
        head_arr_l = np.roll(head_arr_l, length*(rolling-i))
        head_arrs[i-1,:] = head_arr_l

plane = np.insert(plane, 0, head_arrs, axis=0)
df['rolling_nunique'] = pd.DataFrame(plane).nunique(axis=1)
df
###
   col_0  col_1  col_2  col_3  col_4  col_5  rolling_nunique
0      7      4      2      2      7      2                3
1      8      5      5      6      5      6                6
2      9      6      8     10      7     10                6
3     10      7     11     14      5     14                8
4     11      8     14     18      7     18                7
5     12      9     15     21      5     21               10

[reference] numpy.lib.stride_tricks.sliding_window_view

  • Related