Home > OS >  Pandas nlargest values of a Rolling Window
Pandas nlargest values of a Rolling Window

Time:04-24

I have a dataset as a Pandas DataFrame, and I am trying to get the k largest values within a rolling window of size L.

Simplified problem:

import pandas as pd
X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
X.columns = ['A', 'B']
X.head()

###
    A   B
0   0   0
1   -10 5
2   20  -10
3   -30 15
4   40  -20
5   -50 25
6   60  -30
7   -70 35
8   80  -40
9   -90 45
10  100 -50

My goal is to get the k largest values in the rolling window for each column. So if k=3 and the rolling window size is L=[4,7], the way I am currently doing this is

windows = [4, 7]   # Length of each rolling window
k_largest = 3         # Find the k largest values within each rolling window
laggedVals = []
cols = []
for col in X.columns:
    for k in range(k_largest):
        for L in windows:
            x = X.loc[:,col]
            x_k_max = x.rolling(L).apply(lambda c: sorted(c, reverse=True)[k], raw=False)
            laggedVals.append( x_k_max.values )
            cols.append( f'W{L}_{k 1}_{col}' )
laggedVals = pd.DataFrame(np.stack(laggedVals, axis=1), index=X.index, columns=np.ravel(cols))

##
W4_1_A      W7_1_A ...
0   NaN     NaN    ...
1   NaN     NaN
2   NaN     NaN
3   20.0    NaN
4   40.0    NaN
5   40.0    NaN
6   60.0    60.0
7   60.0    60.0
8   80.0    80.0
9   80.0    80.0 ...

so that there are 12 columns (the 3 largest values for window of size 4, 7; for each column).

However my dataset is very large and I'm looking for a more efficient way to do this as the code takes very long to run. Any suggestions?

CodePudding user response:

You can use the pandas built in function rolling (more info here). This takes in a dataframe and applies a roling window calculation based on a pandas built in, or own defined function (with apply). It only takes a single intiger as a window, or a window BaseIndexer subclass. I believe here you can specify multiple windows for multiple columns, but I find it easyer to loop over the colums.

X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
x = pd.DataFrame() #Emtpy dataframe, here roling window will be stored
windows = [4,7]
k = 3
for window, colname in zip(windows,X.columns):
    x[colname] = X[colname].rolling(window).max()

print(x.nlargest(k,columns=x.columns)) #find max k values

result

19  180.0  95.0
18  180.0  85.0
17  160.0  85.0
16  160.0  75.0
0     NaN   NaN
1     NaN   NaN
2     NaN   NaN

CodePudding user response:

hope this works for you,

  1. convert all columns to lists. 2) use heap for the window of size over the list 3) put it back to columns.
  • Related