how to replace the N nlargest values for the sume of those N largests values in a DataFrame pandas-CodePudding

For a given df, I would like to get the nlargest (N) values for each row and replace them with the nlargest(N).sum() value. Basically find the columns where the N max values are present and replace them with their addition if the value in those N largest cells is >= [nlargest().sum/N].

The following df is a reduced example:

data = [[1, 0.9, 0, 0, 0, 0, 0], [2, 0.3, 0.3, 0.3, 0, 0, 0.1], [3, 0, 0, 0, 0.2, 0.4, 0], [4, 1, 0, 0, 0, 0, 0],  [5, 0.1, 0.1, 0.4, 0.4,  0, 0]]
df = pd.DataFrame(data, columns=["ID", 'S1', "S2", 'S3', 'S4', 'S5', 'S6'])
df.set_index('ID', inplace =True)
df

I think i could do something like this:

#get the df.nlargest(N).sum() 

N =3 #number of largest values

df['nlargestsum'] = df.apply(lambda s: s.abs().nlargest(N).sum(), axis=1)

it produces this output:

data = [[1, 0.9, 0, 0, 0, 0, 0, 0.9], [2, 0.3, 0.3, 0.3, 0, 0, 0.1, 0.9], [3, 0, 0, 0, 0.2, 0.4, 0, 0.6], [4, 1, 0, 0, 0, 0, 0, 1],  [5, 0.1, 0.1, 0.4, 0.4,  0, 0, 0.9]]
df = pd.DataFrame(data, columns=["ID", 'S1', "S2", 'S3', 'S4', 'S5', 'S6', 'nlargestsum'])
df.set_index('ID', inplace =True)
df

so far so good, but is form here where I'm struggling how to produce the following output:

data = [[1, 0.9, 0, 0, 0, 0, 0, 0.9], [2, 0.9, 0.9, 0.9, 0, 0, 0.1, 0.9], [3, 0, 0, 0, 0.6, 0.6, 0, 0.6], [4, 1, 0, 0, 0, 0, 0, 1],  [5, 0.1, 0.1, 0.9, 0.9,  0, 0, 0.9]]
df = pd.DataFrame(data, columns=["ID", 'S1', "S2", 'S3', 'S4', 'S5', 'S6', 'nlargestsum'])
df.set_index('ID', inplace =True)
df

as you can see for ID 2, columns S1:S3, the original value changed for the nlargestsum value, because 0.3 >= 0.9/3. However for the ID 5, columns S1 & S2 the value remains the same (< 0.9/3) but the values in ID5 S3 & S4 changed because 0.4 > 0.9/3. Same for ID3, the values in S4 & S5 are changed to 0.6 each.

i would appreciate some insides and ideas!

CodePudding user response：

replacing values

Using numpy:

cols = df.filter(like='S').columns

a = df[cols].to_numpy()
mask = np.argsort(-a) < N

a[mask] = np.repeat((a[mask].reshape(-1, N)).sum(1), N)

df[cols] = a

output:

     S1   S2   S3   S4   S5   S6  nlargestsum
ID                                           
1   0.9  0.9  0.9  0.0  0.0  0.0          0.9
2   0.9  0.9  0.9  0.0  0.0  0.1          0.9
3   0.0  0.0  0.6  0.6  0.6  0.0          0.6
4   1.0  1.0  1.0  0.0  0.0  0.0          1.0
5   0.9  0.1  0.9  0.9  0.0  0.0          0.9

older answer

With numpy:

N = 3
df['nlargestsum'] = (np.sort(df.to_numpy())[:, -N:]).sum(1)

# or, to limit to the "S" columns
# df['nlargestsum'] = (np.sort(df.filter(like='S').to_numpy())[:, -N:]).sum(1)

You can also use:

N = 3
df['nlargestsum'] = (df#.filter(like='S') # uncomment to restrict to the "S" columns
                       .stack().sort_values()
                       .groupby(level=0)
                       .apply(lambda x: x.iloc[-N:].sum())
                    )

output:

     S1   S2   S3   S4   S5   S6  nlargestsum
ID                                           
1   0.9  0.0  0.0  0.0  0.0  0.0          0.9
2   0.3  0.3  0.3  0.0  0.0  0.1          0.9
3   0.0  0.0  0.0  0.2  0.4  0.0          0.6
4   1.0  0.0  0.0  0.0  0.0  0.0          1.0
5   0.1  0.1  0.4  0.4  0.0  0.0          0.9

CodePudding user response：

Use:

#filter only S columns
df1 = df.filter(like='S')
#compare for greater or equal by division nlargestsum with N and if match replace values
df.update(df1.mask(df1.ge(df['nlargestsum'].div(N), axis=0), df['nlargestsum'], axis=0))
print (df)
     S1   S2   S3   S4   S5   S6  nlargestsum
ID                                           
1   0.9  0.0  0.0  0.0  0.0  0.0          0.9
2   0.9  0.9  0.9  0.0  0.0  0.1          0.9
3   0.0  0.0  0.0  0.2  0.6  0.0          0.6
4   1.0  0.0  0.0  0.0  0.0  0.0          1.0
5   0.1  0.1  0.9  0.9  0.0  0.0          0.9