Home > Enterprise >  How to find columns in duplicated rows where are different values in DataFrame in Python Pandas?
How to find columns in duplicated rows where are different values in DataFrame in Python Pandas?

Time:10-10

I have DataFrame in Python like below where we can see duplicates for some ID:

ID COL1 COL2 COL3
123 XX 111 ENG
123 abc 111 ENG
444 ccc 2 o
444 ccc 2 o
67 a 89 xx

And I need to select rows where is situation like for ID = 123, where rows are duplicated but in some column / columns we have different value, so as an output I need something like below:

ID COL1 COL2 COL3
123 XX 111 ENG
123 abc 111 ENG

How can I do that in Python Pandas? I can add that in my real dataset I have many many more columns so I need to create solution whoch will be good for more columns not only ID,COL1,COL2,COL3 :)

CodePudding user response:

here is one way to do it

# drop the duplicates
df.drop_duplicates(inplace=True)

# groupby ID and filter the ones where group size is greater than 1
df[df.groupby('ID')['ID'].transform('size')>1]

    ID  COL1    COL2    COL3
0   123     XX  111     ENG
1   123     abc     111     ENG

alternately,

# preserve the original DF and create a secondary DF with non-duplicate rows
df2=df.drop_duplicates()

# using loc, select the rows in DF2 that has a group size exceeding 1
df2.loc[df2.groupby('ID')['ID'].transform('size')>1]

CodePudding user response:

Using .query

df = df.query("ID.eq(123)").drop_duplicates().reset_index(drop=True)
print(df)

    ID COL1  COL2 COL3
0  123   XX   111  ENG
1  123  abc   111  ENG

Unless you aren't also trying to filter:

df = df.drop_duplicates().reset_index(drop=True)
print(df)

    ID COL1  COL2 COL3
0  123   XX   111  ENG
1  123  abc   111  ENG
2  444  ccc     2    o
3   67    a    89   xx

CodePudding user response:

first drop duplicates for all columns then find duplicates for id column. finally select same ids.

df = df.drop_duplicates()
mask = df.duplicated(subset=['ID'],keep=False)
df = df[mask]
  • Related