I have DataFrame in Python like below where we can see duplicates for some ID:
ID | COL1 | COL2 | COL3 |
---|---|---|---|
123 | XX | 111 | ENG |
123 | abc | 111 | ENG |
444 | ccc | 2 | o |
444 | ccc | 2 | o |
67 | a | 89 | xx |
And I need to select rows where is situation like for ID = 123, where rows are duplicated but in some column / columns we have different value, so as an output I need something like below:
ID | COL1 | COL2 | COL3 |
---|---|---|---|
123 | XX | 111 | ENG |
123 | abc | 111 | ENG |
How can I do that in Python Pandas? I can add that in my real dataset I have many many more columns so I need to create solution whoch will be good for more columns not only ID,COL1,COL2,COL3 :)
CodePudding user response:
here is one way to do it
# drop the duplicates
df.drop_duplicates(inplace=True)
# groupby ID and filter the ones where group size is greater than 1
df[df.groupby('ID')['ID'].transform('size')>1]
ID COL1 COL2 COL3
0 123 XX 111 ENG
1 123 abc 111 ENG
alternately,
# preserve the original DF and create a secondary DF with non-duplicate rows
df2=df.drop_duplicates()
# using loc, select the rows in DF2 that has a group size exceeding 1
df2.loc[df2.groupby('ID')['ID'].transform('size')>1]
CodePudding user response:
Using .query
df = df.query("ID.eq(123)").drop_duplicates().reset_index(drop=True)
print(df)
ID COL1 COL2 COL3
0 123 XX 111 ENG
1 123 abc 111 ENG
Unless you aren't also trying to filter:
df = df.drop_duplicates().reset_index(drop=True)
print(df)
ID COL1 COL2 COL3
0 123 XX 111 ENG
1 123 abc 111 ENG
2 444 ccc 2 o
3 67 a 89 xx
CodePudding user response:
first drop duplicates for all columns then find duplicates for id column. finally select same ids.
df = df.drop_duplicates()
mask = df.duplicated(subset=['ID'],keep=False)
df = df[mask]