For example, in this dataframe I have 2 rows with duplicate values(ABCD, TG) in 2 columns(sequence and ID) and a different value in the 3rd column(group).
How would I check the whole dataframe for any instances where there are duplicate values in these 2 columns(sequence and ID) and if the group value is also different then change group value from 0 to 1.
SEQUENCE | ID | Group |
---|---|---|
ABCD | TG | 0 |
BACD | LK | 0 |
ABCD | TG | 1 |
I would like an output like this:
SEQUENCE | ID | Group |
---|---|---|
ABCD | TG | 1 |
BACD | LK | 0 |
ABCD | TG | 1 |
Thank you for any solutions and suggestion!
CodePudding user response:
Probably it's not the most concise way for doing this but it seems to provide the desired result:
import pandas as pd
d = {
'SEQUENCE': ['ABCD', 'BACD', 'ABCD'],
'ID': ['TG', 'LK', 'TG'],
'Group': [0, 0, 1]
}
df = pd.DataFrame(d)
def check_duplicates(df, seq, elem_id, group):
different_groups = True if df[(df.SEQUENCE == seq) & (df.ID == elem_id)]['Group'].nunique() > 1 else False
sequence_dupls = True if len(df[df.SEQUENCE == seq]) > 1 else False
id_dupls = True if len(df[df.ID == elem_id]) > 1 else False
if sequence_dupls and id_dupls and different_groups:
return 1
else:
return group
df['Group'] = df.apply(lambda row: check_duplicates(df, row.SEQUENCE, row.ID, row.Group), axis=1)
print(df)
CodePudding user response:
Assuming you have only 0, 1 in your Group
column, you could do groupby/transform with numpy's any
The any
function checks if any value is other than 0 if yes returns True:
df['Group'] = df.groupby(['SEQUENCE', 'ID']).transform(lambda x: 1 if np.any(x) else 0)
print(df):
SEQUENCE ID Group
0 ABCD TG 1
1 BACD LK 0
2 ABCD TG 1