Home > Software engineering >  Pandas dataframe, if a value in a column has a duplicate, check if a specific value in another colum
Pandas dataframe, if a value in a column has a duplicate, check if a specific value in another colum

Time:07-15

For example, in this dataframe I have 2 rows with duplicate values(ABCD, TG) in 2 columns(sequence and ID) and a different value in the 3rd column(group).

How would I check the whole dataframe for any instances where there are duplicate values in these 2 columns(sequence and ID) and if the group value is also different then change group value from 0 to 1.

SEQUENCE ID Group
ABCD TG 0
BACD LK 0
ABCD TG 1

I would like an output like this:

SEQUENCE ID Group
ABCD TG 1
BACD LK 0
ABCD TG 1

Thank you for any solutions and suggestion!

CodePudding user response:

Probably it's not the most concise way for doing this but it seems to provide the desired result:

import pandas as pd

d = {
    'SEQUENCE': ['ABCD', 'BACD', 'ABCD'],
    'ID': ['TG', 'LK', 'TG'],
    'Group': [0, 0, 1]
}

df = pd.DataFrame(d)


def check_duplicates(df, seq, elem_id, group):
    
    different_groups = True if df[(df.SEQUENCE == seq) & (df.ID == elem_id)]['Group'].nunique() > 1 else False 
    sequence_dupls = True if len(df[df.SEQUENCE == seq]) > 1 else False
    id_dupls = True if len(df[df.ID == elem_id]) > 1 else False
    
    if sequence_dupls and id_dupls and different_groups:
        return 1 
    else:
        return group


df['Group'] = df.apply(lambda row: check_duplicates(df, row.SEQUENCE, row.ID, row.Group), axis=1)

print(df)

CodePudding user response:

Assuming you have only 0, 1 in your Group column, you could do groupby/transform with numpy's any The any function checks if any value is other than 0 if yes returns True:

df['Group'] = df.groupby(['SEQUENCE', 'ID']).transform(lambda x: 1 if np.any(x) else 0)

print(df):

  SEQUENCE  ID  Group
0     ABCD  TG      1
1     BACD  LK      0
2     ABCD  TG      1
  • Related