Home > database >  Extract lines which contain non-zero value for keys in certain column
Extract lines which contain non-zero value for keys in certain column

Time:12-09

I have very large tab delimited file where 8th column contains mutiple key:values seprated by ; How to extract entire lines based on certain key:values

Conditions to match non-zero key-value pairs for the following

1.

df[(df['AN_NEA'] >0) | (df['AC_NEA'] >0) | (df['HOM_NEA'] >0) | (df['AF_NEA'] >0)]

and similiarly for the following keys

AN_SEA
AC_SEA
HOM_SEA
AF_SEA

Tsv file

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO  
1   10523   .   TCCG    T   513.98  PASS    AC=3;AF=0.0008892;AN=3388;BaseQRankSum=0.926;MQ=45.64;MQRankSum=0.198;QD=5.78;ReadPosRankSum=-0.139;SOR=1.132;VQSLOD=0.453;culprit=SOR;AN_OCE=148;AC_OCE=0;HOM_OCE=0;AF_OCE=0;AN_NEA=654;AC_NEA=0;HOM_NEA=0;AF_NEA=0;AN_WER=228;AC_WER=0;HOM_WER=0;AF_WER=0;AN_AMR=52;AC_AMR=0;HOM_AMR=0;AF_AMR=0;AN_AFR=204;AC_AFR=3;HOM_AFR=0;AF_AFR=0.014705;AN_SAS=1442;AC_SAS=0;HOM_SAS=0;AF_SAS=0;AN_SEA=660;AC_SEA=0;HOM_SEA=0;AF_SEA=0  
1   14397   rs370886505 CTGT    C   11709.3 PASS    AC=160;AF=0.047;AN=3450;BaseQRankSum=0.289;;MQ=30.55;MQRankSum=0;QD=4.02;ReadPosRankSum=1.02;SOR=0.7;VQSLOD=0.421;culprit=QD;AN_OCE=148;AC_OCE=17;HOM_OCE=0;AF_OCE=0.114864;AN_NEA=702;AC_NEA=12;HOM_NEA=0;AF_NEA=0.017094;AN_WER=226;AC_WER=24;HOM_WER=0;AF_WER=0.106194;AN_AMR=52;AC_AMR=7;HOM_AMR=0;AF_AMR=0.134615;AN_AFR=208;AC_AFR=7;HOM_AFR=0;AF_AFR=0.033653;AN_SAS=1426;AC_SAS=65;HOM_SAS=1;AF_SAS=0.045582;AN_SEA=688;AC_SEA=28;HOM_SEA=0;AF_SEA=0.040697  
1   28591   .   T   TGG 4146.14 PASS    AC=67;AF=0.036;AN=1848;BaseQRankSum=0.736;MQ=9.35;MQRankSum=0.736;NEGATIVE_TRAIN_SITE;QD=25.75;ReadPosRankSum=0.804;SOR=1.911;VQSLOD=1.59;culprit=DP;AN_OCE=92;AC_OCE=4;HOM_OCE=1;AF_OCE=0.043478;AN_NEA=426;AC_NEA=39;HOM_NEA=10;AF_NEA=0.091549;AN_WER=74;AC_WER=3;HOM_WER=1;AF_WER=0.04054;AN_AMR=8;AC_AMR=1;HOM_AMR=0;AF_AMR=0.125;AN_AFR=130;AC_AFR=7;HOM_AFR=3;AF_AFR=0.053846;AN_SAS=746;AC_SAS=6;HOM_SAS=2;AF_SAS=0.008042;AN_SEA=372;AC_SEA=7;HOM_SEA=2;AF_SEA=0.018817  
1   29300   .   CG  C   111.6   PASS AC=1;AF=0.0003073;AN=3266;BaseQRankSum=0.163;MQ=25.25;MQRankSum=0.38;NEGATIVE_TRAIN_SITE;QD=6.97;ReadPosRankSum=-0.38;SOR=0.883;VQSLOD=-0.2157;culprit=FS;AN_OCE=142;AC_OCE=0;HOM_OCE=0;AF_OCE=0;AN_NEA=694;AC_NEA=0;HOM_NEA=0;AF_NEA=0;AN_WER=220;AC_WER=0;HOM_WER=0;AF_WER=0;AN_AMR=46;AC_AMR=0;HOM_AMR=0;AF_AMR=0;AN_AFR=176;AC_AFR=1;HOM_AFR=0;AF_AFR=0.005681;AN_SAS=1392;AC_SAS=0;HOM_SAS=0;AF_SAS=0;AN_SEA=596;AC_SEA=0;HOM_SEA=0;AF_SEA=0  
1   30955   .   CCT C   764.56  PASS    AC=33;AF=0.015;AN=2234;BaseQRankSum=0.727;MQ=24.19;MQRankSum=1.03;NEGATIVE_TRAIN_SITE;QD=18.2;ReadPosRankSum=0.736;SOR=2.658;VQSLOD=0.74;culprit=DP;AN_OCE=110;AC_OCE=0;HOM_OCE=0;AF_OCE=0;AN_NEA=556;AC_NEA=3;HOM_NEA=1;AF_NEA=0.005395;AN_WER=130;AC_WER=6;HOM_WER=1;AF_WER=0.046153;AN_AMR=30;AC_AMR=0;HOM_AMR=0;AF_AMR=0;AN_AFR=168;AC_AFR=2;HOM_AFR=1;AF_AFR=0.011904;AN_SAS=726;AC_SAS=17;HOM_SAS=7;AF_SAS=0.023415;AN_SEA=514;AC_SEA=5;HOM_SEA=0;AF_SEA=0.009727  

Any help would be much appreciated

Thank you for your time

CodePudding user response:

If the condition is only != 0 (and assuming all values are positive), you can use a regex with str.contains. Finding a single non zero digit is sufficient:

pattern = '(?:AN_SEA|AC_SEA|HOM_SEA|AF_SEA)=(?:0\.0*)?[1-9]'

out = df[df['INFO'].str.contains(pattern)]

regex demo

If you want to be more flexible (e.g. to use a different threshold) you can extract the number:

threshold = 0
pattern = '(?:AN_SEA|AC_SEA|HOM_SEA|AF_SEA)=(\d\.?\d*)'
out = df[df['INFO'].str.extractall(pattern)[0]
         .astype(float).gt(threshold)
         .groupby(level=0).any()]

regex demo

  • Related