I have a large dataframe, I want to find the maximum of each row in a window size of 5. Here is a simple example of df with shape 3*15. For the first row, 10 is the maximum in the first 5 values, for the second window, the maximum is 20 and for the third window at row 1, the maximum is 101.
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 0, 1]
df['b'] = [10,12,4]
df['c'] = [10, 0, 6]
df['d'] = [10,102,1]
df['aa'] = [-1, 10, 21]
df['bb'] = [10,12,31]
df['ac'] = [-1, 0, 41]
df['bv'] = [20,121,11]
df['a1'] = [3, 10, 1]
df['bx'] = [10,12,51]
df['ad'] = [6, 10, 1]
df['bn'] = [101,12,31]
df['aq'] = [2, 0, 1]
df['bp'] = [1,112,12]
df['ak'] = [23, 0, 1]
a b c d aa bb ac bv a1 bx ad bn aq bp ak
0 1 10 10 10 -1 10 -1 20 3 10 6 101 2 1 23
1 0 12 0 102 10 12 0 121 10 12 10 12 0 112 0
2 1 4 6 1 21 31 41 11 1 51 1 31 1 12 1
Output:
v1 v2 v3
0 10 20 101
1 102 121 112
2 21 51 31
Could you please help me with that?
CodePudding user response:
I would use numpy to reshape it to be 5 wide than use numpy max, and finally reshape it back to be 3 wide
print(df.values.reshape((-1,5)).max(axis=1).reshape(-1,3))
[[ 10 20 101]
[102 121 112]
[ 21 51 31]]
CodePudding user response:
You can first split the dataframe column into 3 parts
split = [df.columns[i*5:(i 1)*5] for i in range(3)]
Then, with the help of df.loc
, you can get the desired dataframe.
pd.DataFrame([df.loc[:, i].max(axis=1) for i in split], index = ['v1','v2','v3'])
0 1 2
v1 10 102 21
v2 20 121 51
v3 101 112 31
CodePudding user response:
If you are looking to generalise it for any number of columns or window size:
from tqdm import tqdm
tqdm.pandas()
w = 5 # window size
df.progress_apply(lambda row: pd.Series([max(sub) for sub in arr.reshape(int(len(row)/w), w)], index=["v" str(i) for i in range(1, int(len(row)/w) 1)]), axis=1)
>> v1 v2 v3
>> 0 10 20 101
>> 1 10 20 101
>> 2 10 20 101
Plus you mentioned that you've a large dataframe - progress_apply
will help to track progress, if the dataset is insanely large.