Here is my vcf file, now I want to extract some values form the last column V350092589_L01_84 to creat new columns.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT V350092589_L01_84
chr19 11224265 . A G 23868.64 PASS AC=1;AF=0.500;AN=2;DP=3417;ExcessHet=0.0000;FS=8.538;MLEAC=1;MLEAF=0.500;MQ=41.37;MQRankSum=1.59;QD=7.57;ReadPosRankSum=9.38;SOR=0.783 GT:AD:DP:GQ:PL 0/1:2029,1125:3154:99:23876,0,49821
chr19 11227576 . C T 8055.64 PASS AC=1;AF=0.500;AN=2;DP=1025;ExcessHet=0.0000;FS=3.316;MLEAC=1;MLEAF=0.500;MQ=41.34;MQRankSum=-4.736e 00;QD=9.12;ReadPosRankSum=2.55;SOR=0.982 GT:AD:DP:GQ:PL 0/1:533,350:883:99:8063,0,15924
Specifically.
- Extracting the sub-column DP=xxxx which is in the last column and generating a new column.
- Extract the sub-column AD: 2092, 1125; AD: 553, 350 and calculate a ratio value, e.g. Ratio=1125/(1125 2092), Ratio=350/(350 553). Then take the Ratio value and generate a new column
Then,combine the two columns to a new csv file, like this
DP Ratio
3154 0.34
883 0.38
CodePudding user response:
You can do it easily in awk
, but your numbers seem to be off by 0.02
(see below for reason). For instance, you can do:
awk '
FNR==1 { # if first record
printf "DP\tRatio\n" # output heading
next # skip to next record
}
{
nf = $NF # save copy of last field
gsub (/[,:]/, " ", nf) # replace "," and ":" with " "
split (nf, arr, " ") # split into arr on space
printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3]) # output result
}
' file
Where for example with record 2, arr[3] == 1125
, and arr[2] == 2029
. The result of the computation is 0.36
not 0.34
? (you used 2092
in your calculation instead of the actual 2029
and 553
instead of the actual 533
-- mystery solved)
Example Use/Output
You can just paste the script into an xterm with your data file (named file
or whatever you change the filename to) as:
$ awk '
> FNR==1 { # if first record
> printf "DP\tRatio\n" # output heading
> next # skip to next record
> }
> {
> nf = $NF # save copy of last field
> gsub (/[,:]/, " ", nf) # replace "," and ":" with " "
> split (nf, arr, " ") # split into arr on space
> printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3]) # output result
> }
> ' file
DP Ratio
3154 0.36
883 0.40
So the output based on the file with the fields split as shown above is:
DP Ratio
3154 0.36
883 0.40
Creating an Awk Script
Creating a script with awk
to run from the command line is a convenient way to make the awk
commands reusable. For example you can create a file say, splitrec.awk
containing:
#!/bin/awk
FNR == 1 { # if first record
printf "DP\tRatio\n" # output heading
next # skip to next record
}
{
nf = $NF # save copy of last field
gsub (/[,:]/, " ", nf) # replace "," and ":" with " "
split (nf, arr, " ") # split into arr on space
printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2] arr[3]) # output result
}
Then your use of the script becomes:
$ awk -f splitrec.awk file
DP Ratio
3154 0.36
883 0.40
CodePudding user response:
Here's a way to extract out the fields, the rest is trivial :
{m,g} '!_<NR ? sub(OFS, _, $!(NF=NF)) : $_="DP\tRatio" ' \
OFS='\t' \
FS='^.*;DP=|[;][^ \t] [ \t] [^ \t] [ \t] [^:] :|[0-9] :[0-9] ,.*,.*$|[:,]'
DP Ratio
3417 2029 1125 3154
1025 533 350 883
A less intuitive way of doing it would be (adding NF
check to protect blank lines)
{m,g}awk ' _~NF ? !_ : !_<NR ? sub(".",_,$!--NF)--NF : $_="DP\tRatio"'