For a given df, I would like to get the nlargest (N) values for each row and replace them with the nlargest(N).sum() value. Basically find the columns where the N max values are present and replace them with their addition if the value in those N largest cells is >= [nlargest().sum/N].
The following df is a reduced example:
data = [[1, 0.9, 0, 0, 0, 0, 0], [2, 0.3, 0.3, 0.3, 0, 0, 0.1], [3, 0, 0, 0, 0.2, 0.4, 0], [4, 1, 0, 0, 0, 0, 0], [5, 0.1, 0.1, 0.4, 0.4, 0, 0]]
df = pd.DataFrame(data, columns=["ID", 'S1', "S2", 'S3', 'S4', 'S5', 'S6'])
df.set_index('ID', inplace =True)
df
I think i could do something like this:
#get the df.nlargest(N).sum()
N =3 #number of largest values
df['nlargestsum'] = df.apply(lambda s: s.abs().nlargest(N).sum(), axis=1)
it produces this output:
data = [[1, 0.9, 0, 0, 0, 0, 0, 0.9], [2, 0.3, 0.3, 0.3, 0, 0, 0.1, 0.9], [3, 0, 0, 0, 0.2, 0.4, 0, 0.6], [4, 1, 0, 0, 0, 0, 0, 1], [5, 0.1, 0.1, 0.4, 0.4, 0, 0, 0.9]]
df = pd.DataFrame(data, columns=["ID", 'S1', "S2", 'S3', 'S4', 'S5', 'S6', 'nlargestsum'])
df.set_index('ID', inplace =True)
df
so far so good, but is form here where I'm struggling how to produce the following output:
data = [[1, 0.9, 0, 0, 0, 0, 0, 0.9], [2, 0.9, 0.9, 0.9, 0, 0, 0.1, 0.9], [3, 0, 0, 0, 0.6, 0.6, 0, 0.6], [4, 1, 0, 0, 0, 0, 0, 1], [5, 0.1, 0.1, 0.9, 0.9, 0, 0, 0.9]]
df = pd.DataFrame(data, columns=["ID", 'S1', "S2", 'S3', 'S4', 'S5', 'S6', 'nlargestsum'])
df.set_index('ID', inplace =True)
df
as you can see for ID 2, columns S1:S3, the original value changed for the nlargestsum value, because 0.3 >= 0.9/3. However for the ID 5, columns S1 & S2 the value remains the same (< 0.9/3) but the values in ID5 S3 & S4 changed because 0.4 > 0.9/3. Same for ID3, the values in S4 & S5 are changed to 0.6 each.
i would appreciate some insides and ideas!
CodePudding user response:
replacing values
Using numpy:
cols = df.filter(like='S').columns
a = df[cols].to_numpy()
mask = np.argsort(-a) < N
a[mask] = np.repeat((a[mask].reshape(-1, N)).sum(1), N)
df[cols] = a
output:
S1 S2 S3 S4 S5 S6 nlargestsum
ID
1 0.9 0.9 0.9 0.0 0.0 0.0 0.9
2 0.9 0.9 0.9 0.0 0.0 0.1 0.9
3 0.0 0.0 0.6 0.6 0.6 0.0 0.6
4 1.0 1.0 1.0 0.0 0.0 0.0 1.0
5 0.9 0.1 0.9 0.9 0.0 0.0 0.9
older answer
With numpy:
N = 3
df['nlargestsum'] = (np.sort(df.to_numpy())[:, -N:]).sum(1)
# or, to limit to the "S" columns
# df['nlargestsum'] = (np.sort(df.filter(like='S').to_numpy())[:, -N:]).sum(1)
You can also use:
N = 3
df['nlargestsum'] = (df#.filter(like='S') # uncomment to restrict to the "S" columns
.stack().sort_values()
.groupby(level=0)
.apply(lambda x: x.iloc[-N:].sum())
)
output:
S1 S2 S3 S4 S5 S6 nlargestsum
ID
1 0.9 0.0 0.0 0.0 0.0 0.0 0.9
2 0.3 0.3 0.3 0.0 0.0 0.1 0.9
3 0.0 0.0 0.0 0.2 0.4 0.0 0.6
4 1.0 0.0 0.0 0.0 0.0 0.0 1.0
5 0.1 0.1 0.4 0.4 0.0 0.0 0.9
CodePudding user response:
Use:
#filter only S columns
df1 = df.filter(like='S')
#compare for greater or equal by division nlargestsum with N and if match replace values
df.update(df1.mask(df1.ge(df['nlargestsum'].div(N), axis=0), df['nlargestsum'], axis=0))
print (df)
S1 S2 S3 S4 S5 S6 nlargestsum
ID
1 0.9 0.0 0.0 0.0 0.0 0.0 0.9
2 0.9 0.9 0.9 0.0 0.0 0.1 0.9
3 0.0 0.0 0.0 0.2 0.6 0.0 0.6
4 1.0 0.0 0.0 0.0 0.0 0.0 1.0
5 0.1 0.1 0.9 0.9 0.0 0.0 0.9