Home > Software design >  Pandas Filter rows by comparing columns A1 with A2
Pandas Filter rows by comparing columns A1 with A2

Time:09-20

CHR SNP BP A1 A2 OR P
8 rs62513865 101592213 T C 1.00652 0.8086
8 rs79643588 106973048 A T 1.01786 0.4606

I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line

A1 A2
A T
T A
C G
G C

(e.g. line 2 in the first table).
How can i do that using python Pandas ?

CodePudding user response:

here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one


df[~(df['A1'] df['A2']).str.strip()
   .isin(df2['A1'] df2['A2'].tolist())]

    CHR     SNP     BP  A1  A2  OR  P
0   8   rs62513865  101592213   T   C   1.00652     0.8086

CodePudding user response:

keeping

Assuming df1 and df2, you can simply merge to keep the common values:

out = df1.merge(df2)

output:

   CHR         SNP         BP A1 A2       OR       P
0    8  rs79643588  106973048  A  T  1.01786  0.4606

dropping

For removing the rows, perform a negative merge:

out = (df1.merge(df2, how='outer', indicator=True)
          .loc[lambda d: d.pop('_merge').eq('left_only')]
       )

Or merge and get the remaining indices to drop (requires unique indices):

out = df1.drop(df1.reset_index().merge(df2)['index'])

output:

   CHR         SNP           BP A1 A2       OR       P
0  8.0  rs62513865  101592213.0  T  C  1.00652  0.8086

alternative approach

As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:

m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])

out = df1[m]
  • Related