I have a pandas df with four columns (Target1, Target2, Target3, All_targets). As this df was created directly by a tool, I would like to identify whether there are rows that contain the same combination of targets but in a different order. More specifically, I would like to identify only those rows that all three targets are the same.
Here is a reproducible example of my dataframe.
data = {'Target1':['IL17', 'TLR4', 'ERK', 'IL36','MEK'],
'Target2':['CD80', 'ERK', 'IL17', 'STAT1','TLR4'],
'Target13':['ERK', 'MEK', 'CD80', 'IL18','STAT3'],
'All_targets':['IL17_CD80_ERK', 'TLR4_ERK_MEK', 'ERK_IL17_CD80', 'IL36_STAT1_IL18','MEK_TLR4_STAT3']}
df = pd.DataFrame(data)
df
The expected output would be a dataframe without those rows that contain the same three targets.
filtered_df = {'Target1':['IL17', 'TLR4', 'IL36','MEK'],
'Target2':['CD80', 'ERK', 'STAT1','TLR4'],
'Target13':['ERK', 'MEK', 'IL18','STAT3'],
'All_targets':['IL17_CD80_ERK', 'TLR4_ERK_MEK', 'IL36_STAT1_IL18','MEK_TLR4_STAT3']}
Thank you in advance!
CodePudding user response:
Create frozensets for first 3 columns selected by position and then duplicates by Series.duplicated
, last filter by boolean indexing
:
mask = df.iloc[:, :3].apply(frozenset, 1).duplicated(keep=False)
df1 = df[mask]
print (df1)
Target1 Target2 Target13 All_targets
0 IL17 CD80 ERK IL17_CD80_ERK
2 ERK IL17 CD80 ERK_IL17_CD80
df2 = df[~mask]
CodePudding user response:
:) You can also try this:
df['sorted'] = df['All_targets'].apply(lambda x: ','.join(sorted(x.split('_'))))
df = df.drop_duplicates(subset = ['sorted'])
df.drop(columns = ['sorted'], inplace = True)
df