I have a pandas dataframe df like this:
ID | Weight | A |
---|---|---|
a | 0.15 | 1 |
a | 0.25 | 3 |
a | 0.02 | 2 |
a | 0.07 | 3 |
b | 0.01 | 1 |
b | 0.025 | 5 |
b | 0.07 | 7 |
b | 0.06 | 4 |
b | 0.12 | 2 |
I want to remove rows based on the ID column and Percentile of weight column such that, for df['ID'] = a, there are four rows. But if I want to keep at least 80% (it can vary) weight, I have to keep only rows with 0.15 and 0.25 weights (81.6%, whenever adding a weight crosses 80%, rest of the rows with the same 'ID' will be removed).
After the operation, df will become like this:
ID | Weight | A |
---|---|---|
a | 0.15 | 1 |
a | 0.25 | 3 |
b | 0.07 | 7 |
b | 0.06 | 4 |
b | 0.12 | 2 |
CodePudding user response:
Assume you need the highest weight per ID but only keep weights that just cross the 0.8 threshold:
(df.sort_values('Weight', ascending=False)
.groupby('ID', group_keys=False)
.apply(lambda g: g[(g.Weight.cumsum() / g.Weight.sum()).lt(0.8).shift(fill_value=True)]))
ID Weight A
1 a 0.25 3
0 a 0.15 1
8 b 0.12 2
6 b 0.07 7
7 b 0.06 4
We first sort Weight
in descending order, and then group by ID
and calculates cumulative weight percentage which we compare with the threshold. Note we shift the condition so that we keep the first value that crosses the threshold as indicated in the question.