Home > Software design >  How to extract some values in a column and calculate them, then generate a new column in linux
How to extract some values in a column and calculate them, then generate a new column in linux

Time:06-10

Here is my vcf file, now I want to extract some values form the last column V350092589_L01_84 to creat new columns.

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  V350092589_L01_84
    chr19   11224265        .       A       G       23868.64        PASS    AC=1;AF=0.500;AN=2;DP=3417;ExcessHet=0.0000;FS=8.538;MLEAC=1;MLEAF=0.500;MQ=41.37;MQRankSum=1.59;QD=7.57;ReadPosRankSum=9.38;SOR=0.783          GT:AD:DP:GQ:PL  0/1:2029,1125:3154:99:23876,0,49821
    chr19   11227576        .       C       T       8055.64         PASS    AC=1;AF=0.500;AN=2;DP=1025;ExcessHet=0.0000;FS=3.316;MLEAC=1;MLEAF=0.500;MQ=41.34;MQRankSum=-4.736e 00;QD=9.12;ReadPosRankSum=2.55;SOR=0.982    GT:AD:DP:GQ:PL  0/1:533,350:883:99:8063,0,15924

Specifically.

  1. Extracting the sub-column DP=xxxx which is in the last column and generating a new column.
  2. Extract the sub-column AD: 2092, 1125; AD: 553, 350 and calculate a ratio value, e.g. Ratio=1125/(1125 2092), Ratio=350/(350 553). Then take the Ratio value and generate a new column

Then,combine the two columns to a new csv file, like this

DP     Ratio
3154   0.34
883    0.38

CodePudding user response:

You can do it easily in awk, but your numbers seem to be off by 0.02 (see below for reason). For instance, you can do:

awk '
  FNR==1 {                    # if first record
    printf "DP\tRatio\n"      # output heading
    next                      # skip to next record
  } 
  {
    nf = $NF                  # save copy of last field
    gsub (/[,:]/, " ", nf)    # replace "," and ":" with " "
    split (nf, arr, " ")      # split into arr on space
    printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3])  # output result
  }
' file

Where for example with record 2, arr[3] == 1125, and arr[2] == 2029. The result of the computation is 0.36 not 0.34? (you used 2092 in your calculation instead of the actual 2029 and 553 instead of the actual 533 -- mystery solved)

Example Use/Output

You can just paste the script into an xterm with your data file (named file or whatever you change the filename to) as:

$ awk '
>   FNR==1 {                    # if first record
>     printf "DP\tRatio\n"      # output heading
>     next                      # skip to next record
>   }
>   {
>     nf = $NF                  # save copy of last field
>     gsub (/[,:]/, " ", nf)    # replace "," and ":" with " "
>     split (nf, arr, " ")      # split into arr on space
>     printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3])  # output result
>   }
> ' file
DP      Ratio
3154    0.36
883     0.40

So the output based on the file with the fields split as shown above is:

DP      Ratio
3154    0.36
883     0.40

Creating an Awk Script

Creating a script with awk to run from the command line is a convenient way to make the awk commands reusable. For example you can create a file say, splitrec.awk containing:

#!/bin/awk

FNR == 1 {                  # if first record
  printf "DP\tRatio\n"      # output heading
  next                      # skip to next record
}
{
  nf = $NF                  # save copy of last field
  gsub (/[,:]/, " ", nf)    # replace "," and ":" with " "
  split (nf, arr, " ")      # split into arr on space
  printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3])  # output result
}

Then your use of the script becomes:

$  awk -f splitrec.awk file
DP      Ratio
3154    0.36
883     0.40

CodePudding user response:

Here's a way to extract out the fields, the rest is trivial :

{m,g} '!_<NR ? sub(OFS, _, $!(NF=NF)) : $_="DP\tRatio" ' \
                   OFS='\t'                               \
       FS='^.*;DP=|[;][^ \t] [ \t] [^ \t] [ \t] [^:] :|[0-9] :[0-9] ,.*,.*$|[:,]'
       

DP      Ratio
3417    2029  1125  3154        
1025    533   350   883 

A less intuitive way of doing it would be (adding NF check to protect blank lines)

{m,g}awk ' _~NF ? !_ : !_<NR ? sub(".",_,$!--NF)--NF : $_="DP\tRatio"' 
  • Related