Home > database >  Removing all non-unique rows from a dataframe
Removing all non-unique rows from a dataframe

Time:11-06

Sorry, this is my second post - please let me know if something doesn't make sense!

I'm trying to remove all rows that have any duplicates. I've tried the keep = False parameter for drop_duplicates(), and its just not doing the right thing.

lets say my dataframe looks something like this

|ORDER ID | ITEM CODE |
123         XXX    
123         YYY
123         YYY
456         XXX
456         XXX
456         XXX
789         XXX
000         YYY

I want it to look like this:

|ORDER ID | ITEM CODE |
123         XXX    
789         XXX
000         YYY

CodePudding user response:

So I suggest that you use a loop to iterate through every row, then whilst iterating through each line use an if statement to compare the current row to the last, if it is exclude, if it isn't return the row.

CodePudding user response:

Try using

df = df.drop_duplicates(subset='ORDER ID')

CodePudding user response:

I managed to compile the answer from two other answers:

  1. We shall find the lines to drop. https://stackoverflow.com/a/64105947/2681662
  2. We use that dataframe to drop it. https://stackoverflow.com/a/44706892/2681662

Find lines to drop:

import pandas as pd

lst = [
    [123, "XXX"],
    [123, "YYY"],
    [123, "YYY"],
    [456, "XXX"],
    [456, "XXX"],
    [456, "XXX"],
    [789, "XXX"],
    [000, "YYY"],
]

df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])

to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]

Drop all lines according to to_drop

So the whole code would look like:

import pandas as pd

lst = [
    [123, "XXX"],
    [123, "YYY"],
    [123, "YYY"],
    [456, "XXX"],
    [456, "XXX"],
    [456, "XXX"],
    [789, "XXX"],
    [000, "YYY"],
]

df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])

to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]

print(pd.merge(df,to_drop, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))

CodePudding user response:

let's define your sample DataFrame,

data = {"ORDER ID":[123, 123, 123, 456, 456, 456, 789, 000], "ITEM CODE":['XXX', 'YYY', 'YYY', 'XXX', 'XXX', 'XXX', 'XXX', 'YYY']}

df = pd.DataFrame(data)

 ORDER ID ITEM CODE
  123       XXX
  123       YYY
  123       YYY
  456       XXX
  456       XXX
  456       XXX
  789       XXX
    0       YYY

You can remove duplicates based on desired columns or all columns, subset parameter can be a list of column names.

new_df = df.drop_duplicates(subset='ORDER ID')

 ORDER ID ITEM CODE
  123       XXX
  456       XXX
  789       XXX
    0       YYY
  • Related