I'm trying to remove all rows that have any duplicates. I ONLY want the unique rows. I've tried the keep = False
parameter for drop_duplicates()
with `subset = [ORDER ID, ITEM CODE] , and its just not doing the right thing.
lets say my dataframe looks something like this
|ORDER ID | ITEM CODE |
123 XXX
123 YYY
123 YYY
456 XXX
456 XXX
456 XXX
789 XXX
000 YYY
I want it to look like this:
|ORDER ID | ITEM CODE |
123 XXX
789 XXX
000 YYY
As you can see the subset would be both the order ID and Item code columns and we would lose rows 2-6 ideally. (The actual dataset has a lot more columns.)
CodePudding user response:
Not sure what your issue is. Works fine.
import pandas as pd
data = [[123, 'XXX', 11],
[123, 'YYY', 22],
[123, 'YYY', 33],
[456, 'XXX', 44],
[456, 'XXX', 55],
[456, 'XXX', 66],
[789, 'XXX',77],
[000, 'YYY',88]]
columns = ['ORDER ID','ITEM CODE','extra column']
df = pd.DataFrame(data, columns=columns)
df = df.drop_duplicates(subset = ['ORDER ID','ITEM CODE'], keep=False)
Output:
Before
print(df)
ORDER ID ITEM CODE extra column
0 123 XXX 11
1 123 YYY 22
2 123 YYY 33
3 456 XXX 44
4 456 XXX 55
5 456 XXX 66
6 789 XXX 77
7 0 YYY 88
After
print(df)
ORDER ID ITEM CODE extra column
0 123 XXX 11
6 789 XXX 77
7 0 YYY 88
CodePudding user response:
Try modifying your subset to include ORDER ID
only:
df.drop_duplicated(subset=['ORDER ID'])
CodePudding user response:
It's very likely that you are simply not setting the dataframe properly. You might be doing
df.drop_duplicates()
But this would fail to overwrite your previous values. Rather you should be doing
df = df.drop_duplicates()
If you can't get drop_duplicates to work, you can use numpy.unique as a workaround.
df['ORDER_ID'] = np.unique(df['ORDER_ID'])
df['ITEM_CODE'] = np.unique(df['ITEM_CODE'])