I have a dataset as a Pandas DataFrame, and I am trying to get the k
largest values within a rolling window of size L
.
Simplified problem:
import pandas as pd
X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
X.columns = ['A', 'B']
X.head()
###
A B
0 0 0
1 -10 5
2 20 -10
3 -30 15
4 40 -20
5 -50 25
6 60 -30
7 -70 35
8 80 -40
9 -90 45
10 100 -50
My goal is to get the k
largest values in the rolling window for each column. So if k=3
and the rolling window size is L=[4,7]
, the way I am currently doing this is
windows = [4, 7] # Length of each rolling window
k_largest = 3 # Find the k largest values within each rolling window
laggedVals = []
cols = []
for col in X.columns:
for k in range(k_largest):
for L in windows:
x = X.loc[:,col]
x_k_max = x.rolling(L).apply(lambda c: sorted(c, reverse=True)[k], raw=False)
laggedVals.append( x_k_max.values )
cols.append( f'W{L}_{k 1}_{col}' )
laggedVals = pd.DataFrame(np.stack(laggedVals, axis=1), index=X.index, columns=np.ravel(cols))
##
W4_1_A W7_1_A ...
0 NaN NaN ...
1 NaN NaN
2 NaN NaN
3 20.0 NaN
4 40.0 NaN
5 40.0 NaN
6 60.0 60.0
7 60.0 60.0
8 80.0 80.0
9 80.0 80.0 ...
so that there are 12 columns (the 3 largest values for window of size 4, 7; for each column).
However my dataset is very large and I'm looking for a more efficient way to do this as the code takes very long to run. Any suggestions?
CodePudding user response:
You can use the pandas built in function rolling (more info here). This takes in a dataframe and applies a roling window calculation based on a pandas built in, or own defined function (with apply). It only takes a single intiger as a window, or a window BaseIndexer subclass. I believe here you can specify multiple windows for multiple columns, but I find it easyer to loop over the colums.
X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
x = pd.DataFrame() #Emtpy dataframe, here roling window will be stored
windows = [4,7]
k = 3
for window, colname in zip(windows,X.columns):
x[colname] = X[colname].rolling(window).max()
print(x.nlargest(k,columns=x.columns)) #find max k values
result
19 180.0 95.0
18 180.0 85.0
17 160.0 85.0
16 160.0 75.0
0 NaN NaN
1 NaN NaN
2 NaN NaN
CodePudding user response:
hope this works for you,
- convert all columns to lists. 2) use heap for the window of size over the list 3) put it back to columns.