Home > Software design >  Remove rows from Dataframe with duplicate custom indexes and keep the row having max value for a col
Remove rows from Dataframe with duplicate custom indexes and keep the row having max value for a col

Time:08-29

I have thousands of pandas dataframe df, each with 120 million rows that looks like this:

     a   b     c    d    e
111  10  100   200  300  james
112  11  110   210  310  june
112  11  140   210  312  dune
234  12  120   220  333  dane

The repeating index is custom index is 112. I want to keep the row with maximum value for column 'b' along with the custom index '112'. So the output would look like

      a   b     c    d    e
111  10  100   200  300  james
112  11  140   210  312  dune
234  12  120   220  333  dane

What would be a memory and speed efficient way to do this?

CodePudding user response:

You can try the following.

dupe_idx_rows = df[df.index.duplicated(keep=False)].sort_values(by='b', ascending=False)

dupe_idx_rows_one = dupe_idx_rows[~dupe_idx_rows.index.duplicated(keep='first')]

out= pd.concat([df[~df.index.duplicated(keep=False)],dupe_idx_rows_one])

Alternatively, you can also try using Groupby

CodePudding user response:

You can filter by the maximum value of a column in each group by using enter image description here

  • Related