Home > Mobile >  Filtering a pandas dataframe to remove duplicates with a criterion
Filtering a pandas dataframe to remove duplicates with a criterion

Time:01-14

I am new to pandas dataframes, so I apologies in case there's an easy or even built-in way to do so.

Let's say I have a dataframe df with 3 columns A (a string), B (a float) and C (a bool). Values of column A are not unique. B is a random number and rows with same A value can have different values of B. Columns C is True if the value of A is repeated in the dataset.

An example

|   | A   | B   | C     |
|---|-----|-----|-------|
| 0 | cat | 10  | True  |
| 1 | dog | 10  | False |
| 2 | cat | 20  | True  |
| 3 | bee | 100 | False |

(The column C is actually redundant and could be obtained with df['C']=df['A'].duplicated(keep=False))

What I want to obtain is a dataframe were, for duplicated entries of A (C==True), only the row with the highest B value is kept.

I know how to get the list of rows with maximum value of B:

df.loc[df[df['C']].groupby('A')['B'].idxmax()]   #is this the best way actually?

but what I want is the opposite: filter df so to get only the entries not duplicated (C==False) and the duplicated ones with the highest B.

One possibility could be to concatenate df[~df['C']] and the previous table but is it the best way actually?

CodePudding user response:

One approach:

res = df.iloc[df.groupby("A")["B"].idxmax()]
print(res)

Output

     A    B      C
3  bee  100  False
2  cat   20   True
1  dog   10  False
  • Related