Home > other >  Identify rows that contain the same values but in different order in pandas
Identify rows that contain the same values but in different order in pandas

Time:08-09

I have a pandas df with four columns (Target1, Target2, Target3, All_targets). As this df was created directly by a tool, I would like to identify whether there are rows that contain the same combination of targets but in a different order. More specifically, I would like to identify only those rows that all three targets are the same.

Here is a reproducible example of my dataframe.

data = {'Target1':['IL17', 'TLR4', 'ERK', 'IL36','MEK'],
        'Target2':['CD80', 'ERK', 'IL17', 'STAT1','TLR4'],
        'Target13':['ERK', 'MEK', 'CD80', 'IL18','STAT3'],
        'All_targets':['IL17_CD80_ERK', 'TLR4_ERK_MEK', 'ERK_IL17_CD80', 'IL36_STAT1_IL18','MEK_TLR4_STAT3']}
df = pd.DataFrame(data)
df

The expected output would be a dataframe without those rows that contain the same three targets.

filtered_df = {'Target1':['IL17', 'TLR4', 'IL36','MEK'],
            'Target2':['CD80', 'ERK', 'STAT1','TLR4'],
            'Target13':['ERK', 'MEK', 'IL18','STAT3'],
            'All_targets':['IL17_CD80_ERK', 'TLR4_ERK_MEK', 'IL36_STAT1_IL18','MEK_TLR4_STAT3']}

Thank you in advance!

CodePudding user response:

Create frozensets for first 3 columns selected by position and then duplicates by Series.duplicated, last filter by boolean indexing:

mask = df.iloc[:, :3].apply(frozenset, 1).duplicated(keep=False)
df1 = df[mask]
print (df1)

  Target1 Target2 Target13    All_targets
0    IL17    CD80      ERK  IL17_CD80_ERK
2     ERK    IL17     CD80  ERK_IL17_CD80

df2 = df[~mask]

CodePudding user response:

:) You can also try this:

df['sorted'] = df['All_targets'].apply(lambda x: ','.join(sorted(x.split('_'))))
df = df.drop_duplicates(subset = ['sorted'])
df.drop(columns = ['sorted'], inplace = True)
df
  • Related