How to extract some values in a column and calculate them, then generate a new column in linux-CodePudding

Here is my vcf file, now I want to extract some values form the last column V350092589_L01_84 to creat new columns.

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  V350092589_L01_84
    chr19   11224265        .       A       G       23868.64        PASS    AC=1;AF=0.500;AN=2;DP=3417;ExcessHet=0.0000;FS=8.538;MLEAC=1;MLEAF=0.500;MQ=41.37;MQRankSum=1.59;QD=7.57;ReadPosRankSum=9.38;SOR=0.783          GT:AD:DP:GQ:PL  0/1:2029,1125:3154:99:23876,0,49821
    chr19   11227576        .       C       T       8055.64         PASS    AC=1;AF=0.500;AN=2;DP=1025;ExcessHet=0.0000;FS=3.316;MLEAC=1;MLEAF=0.500;MQ=41.34;MQRankSum=-4.736e 00;QD=9.12;ReadPosRankSum=2.55;SOR=0.982    GT:AD:DP:GQ:PL  0/1:533,350:883:99:8063,0,15924

Specifically.

Extracting the sub-column DP=xxxx which is in the last column and generating a new column.
Extract the sub-column AD: 2092, 1125; AD: 553, 350 and calculate a ratio value, e.g. Ratio=1125/(1125 2092), Ratio=350/(350 553). Then take the Ratio value and generate a new column

Then,combine the two columns to a new csv file, like this

DP     Ratio
3154   0.34
883    0.38

CodePudding user response：

You can do it easily in awk, but your numbers seem to be off by 0.02 (see below for reason). For instance, you can do:

awk '
  FNR==1 {                    # if first record
    printf "DP\tRatio\n"      # output heading
    next                      # skip to next record
  } 
  {
    nf = $NF                  # save copy of last field
    gsub (/[,:]/, " ", nf)    # replace "," and ":" with " "
    split (nf, arr, " ")      # split into arr on space
    printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3])  # output result
  }
' file

Where for example with record 2, arr[3] == 1125, and arr[2] == 2029. The result of the computation is 0.36 not 0.34? (you used 2092 in your calculation instead of the actual 2029 and 553 instead of the actual 533 -- mystery solved)

Example Use/Output

You can just paste the script into an xterm with your data file (named file or whatever you change the filename to) as:

$ awk '
>   FNR==1 {                    # if first record
>     printf "DP\tRatio\n"      # output heading
>     next                      # skip to next record
>   }
>   {
>     nf = $NF                  # save copy of last field
>     gsub (/[,:]/, " ", nf)    # replace "," and ":" with " "
>     split (nf, arr, " ")      # split into arr on space
>     printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3])  # output result
>   }
> ' file
DP      Ratio
3154    0.36
883     0.40

So the output based on the file with the fields split as shown above is:

DP      Ratio
3154    0.36
883     0.40

Creating an Awk Script

Creating a script with awk to run from the command line is a convenient way to make the awk commands reusable. For example you can create a file say, splitrec.awk containing:

#!/bin/awk

FNR == 1 {                  # if first record
  printf "DP\tRatio\n"      # output heading
  next                      # skip to next record
}
{
  nf = $NF                  # save copy of last field
  gsub (/[,:]/, " ", nf)    # replace "," and ":" with " "
  split (nf, arr, " ")      # split into arr on space
  printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3])  # output result
}

Then your use of the script becomes:

$  awk -f splitrec.awk file
DP      Ratio
3154    0.36
883     0.40

CodePudding user response：

Here's a way to extract out the fields, the rest is trivial :

{m,g} '!_<NR ? sub(OFS, _, $!(NF=NF)) : $_="DP\tRatio" ' \
                   OFS='\t'                               \
       FS='^.*;DP=|[;][^ \t] [ \t] [^ \t] [ \t] [^:] :|[0-9] :[0-9] ,.*,.*$|[:,]'

DP      Ratio
3417    2029  1125  3154        
1025    533   350   883

A less intuitive way of doing it would be (adding NF check to protect blank lines)

{m,g}awk ' _~NF ? !_ : !_<NR ? sub(".",_,$!--NF)--NF : $_="DP\tRatio"'