Sorry, this is my second post - please let me know if something doesn't make sense!
I'm trying to remove all rows that have any duplicates. I've tried the keep = False
parameter for drop_duplicates()
, and its just not doing the right thing.
lets say my dataframe looks something like this
|ORDER ID | ITEM CODE |
123 XXX
123 YYY
123 YYY
456 XXX
456 XXX
456 XXX
789 XXX
000 YYY
I want it to look like this:
|ORDER ID | ITEM CODE |
123 XXX
789 XXX
000 YYY
CodePudding user response:
So I suggest that you use a loop to iterate through every row, then whilst iterating through each line use an if statement to compare the current row to the last, if it is exclude, if it isn't return the row.
CodePudding user response:
Try using
df = df.drop_duplicates(subset='ORDER ID')
CodePudding user response:
I managed to compile the answer from two other answers:
- We shall find the lines to drop. https://stackoverflow.com/a/64105947/2681662
- We use that dataframe to drop it. https://stackoverflow.com/a/44706892/2681662
Find lines to drop:
import pandas as pd
lst = [
[123, "XXX"],
[123, "YYY"],
[123, "YYY"],
[456, "XXX"],
[456, "XXX"],
[456, "XXX"],
[789, "XXX"],
[000, "YYY"],
]
df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])
to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]
Drop all lines according to to_drop
So the whole code would look like:
import pandas as pd
lst = [
[123, "XXX"],
[123, "YYY"],
[123, "YYY"],
[456, "XXX"],
[456, "XXX"],
[456, "XXX"],
[789, "XXX"],
[000, "YYY"],
]
df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])
to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]
print(pd.merge(df,to_drop, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
CodePudding user response:
let's define your sample DataFrame,
data = {"ORDER ID":[123, 123, 123, 456, 456, 456, 789, 000], "ITEM CODE":['XXX', 'YYY', 'YYY', 'XXX', 'XXX', 'XXX', 'XXX', 'YYY']}
df = pd.DataFrame(data)
ORDER ID ITEM CODE
123 XXX
123 YYY
123 YYY
456 XXX
456 XXX
456 XXX
789 XXX
0 YYY
You can remove duplicates based on desired columns or all columns, subset parameter can be a list of column names.
new_df = df.drop_duplicates(subset='ORDER ID')
ORDER ID ITEM CODE
123 XXX
456 XXX
789 XXX
0 YYY