I am new to pandas dataframes, so I apologies in case there's an easy or even built-in way to do so.
Let's say I have a dataframe df
with 3 columns A
(a string), B
(a float) and C
(a bool). Values of column A
are not unique. B
is a random number and rows with same A
value can have different values of B
. Columns C
is True
if the value of A
is repeated in the dataset.
An example
| | A | B | C |
|---|-----|-----|-------|
| 0 | cat | 10 | True |
| 1 | dog | 10 | False |
| 2 | cat | 20 | True |
| 3 | bee | 100 | False |
(The column C
is actually redundant and could be obtained with df['C']=df['A'].duplicated(keep=False)
)
What I want to obtain is a dataframe were, for duplicated entries of A
(C==True
), only the row with the highest B
value is kept.
I know how to get the list of rows with maximum value of B
:
df.loc[df[df['C']].groupby('A')['B'].idxmax()] #is this the best way actually?
but what I want is the opposite: filter df
so to get only the entries not duplicated (C==False
) and the duplicated ones with the highest B.
One possibility could be to concatenate df[~df['C']]
and the previous table but is it the best way actually?
CodePudding user response:
One approach:
res = df.iloc[df.groupby("A")["B"].idxmax()]
print(res)
Output
A B C
3 bee 100 False
2 cat 20 True
1 dog 10 False