CHR | SNP | BP | A1 | A2 | OR | P |
---|---|---|---|---|---|---|
8 | rs62513865 | 101592213 | T | C | 1.00652 | 0.8086 |
8 | rs79643588 | 106973048 | A | T | 1.01786 | 0.4606 |
I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line
A1 | A2 |
---|---|
A | T |
T | A |
C | G |
G | C |
(e.g. line 2 in the first table).
How can i do that using python Pandas ?
CodePudding user response:
here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one
df[~(df['A1'] df['A2']).str.strip()
.isin(df2['A1'] df2['A2'].tolist())]
CHR SNP BP A1 A2 OR P
0 8 rs62513865 101592213 T C 1.00652 0.8086
CodePudding user response:
keeping
Assuming df1
and df2
, you can simply merge
to keep the common values:
out = df1.merge(df2)
output:
CHR SNP BP A1 A2 OR P
0 8 rs79643588 106973048 A T 1.01786 0.4606
dropping
For removing the rows, perform a negative merge:
out = (df1.merge(df2, how='outer', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
Or merge
and get the remaining indices to drop
(requires unique indices):
out = df1.drop(df1.reset_index().merge(df2)['index'])
output:
CHR SNP BP A1 A2 OR P
0 8.0 rs62513865 101592213.0 T C 1.00652 0.8086
alternative approach
As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:
m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])
out = df1[m]