I have a dataframe as below
CHROM POS FORMAT HG02317 HG03195 HG01959 HG03197
chr1 241663902 GT 0/0 0/0 0/0 0/1
chr1 241663902 GT 0/0 0/0 0/0 0/0
chr2 47641559 GT 0/1 0/1 0/0 0/0
chrX 47641559 GT 1 0 1 0
I tried to use
filtered= df[df['HG02317'].str.contains("0/1|1")]
but this removes all other rows which still contains 0/1|1. I want to check all columns from iloc[3:] for 0/1 or 1 and then delete it if not present. Also, as column names from [3:] keeps varying i cant use the below as well though it works
filtered= df.query('HG02317== "0/1" | HG01959 == "0/1"')
Which means i want to search by col index
Using index ,
filtered= df[(df.loc[:, 3:] == '0/1')] gives error.
My expected Output is:
CHROM POS FORMAT HG02317 HG03195 HG01959 HG03197
chr1 241663902 GT 0/0 0/0 0/0 0/1
chr2 47641559 GT 0/1 0/1 0/0 0/0
chrX 47641559 GT 1 0 1 0
Is there any work around for this in pandas? Thanks in advance
CodePudding user response:
First select all columns with omit first 3 use DataFrame.iloc
, then compare by DataFrame.isin
and test by DataFrame.any
all rows if at least one value match:
#added 1 like string or 1 like number
filtered = df[df.iloc[:, 3:].isin(['0/1', '1', 1]).any(axis=1)]
print (filtered)
CHROM POS FORMAT HG02317 HG03195 HG01959 HG03197
0 chr1 241663902 GT 0/0 0/0 0/0 0/1
2 chr2 47641559 GT 0/1 0/1 0/0 0/0
3 chrX 47641559 GT 1 0 1 0