How to print fields containing specific substrings with awk?-CodePudding

Goal: To print fields from an input file to an output file, but with special regard to specific fields split by semicolons (; — see field nine of example input below).

Example Input (input.txt):

NC_051336.1     Gnomon  gene    40042   56215   .       -       .       gene_id "LOC102552844"; transcript_id ""; db_xref "GeneID:102552844"; db_xref "RGD:7551986"; gbkey "Gene"; gene "LOC102552844"; gene_biotype "pseudogene"; pseudo "true";
NC_051336.1     BestRefSeq,Gnomon     gene    76909   85762   .               .       gene_id "Vom2r3"; transcript_id ""; db_xref "GeneID:502213"; db_xref "RGD:1565892"; description "vomeronasal 2 receptor, 3"; gbkey "Gene"; gene "Vom2r3"; gene_biotype "protein_coding"; gene_synonym "RGD1565892";
NC_051336.1     Gnomon  gene    162525  192445  .       -       .       gene_id "LOC103693496"; transcript_id ""; db_xref "GeneID:103693496"; db_xref "RGD:9090555"; gbkey "Gene"; gene "LOC103693496"; gene_biotype "protein_coding";
NC_051336.1     Gnomon  transcript      162525  167758  .       -       .       gene_id "LOC103693496"; transcript_id "XM_039098304.1"; db_xref "GeneID:103693496"; gbkey "mRNA"; gene "LOC103693496"; model_evidence "Supporting evidence includes similarity to: 4 Proteins, and 90% coverage of the annotated genomic feature by RNAseq alignments"; product "sperm motility kinase X-like, transcript variant X9"; transcript_biotype "mRNA";
NC_051336.1     BestRefSeq      gene    226465  241532  .       -       .       gene_id "Vom2r4"; transcript_id ""; db_xref "GeneID:308248"; db_xref "RGD:1564110"; description "vomeronasal 2 receptor, 4"; gbkey "Gene"; gene "Vom2r4"; gene_biotype "protein_coding"; gene_synonym "RGD1564110";
NC_051336.1     BestRefSeq      transcript      226465  241532  .       -       .       gene_id "Vom2r4"; transcript_id "NM_001099458.1"; db_xref "GeneID:308248"; gbkey "mRNA"; gene "Vom2r4"; product "vomeronasal 2 receptor, 4"; transcript_biotype "mRNA";
NC_051336.1     Curated Genomic gene    267100  275769  .       -       .       gene_id "Vom2r-ps8"; transcript_id ""; db_xref "GeneID:502214"; db_xref "RGD:1563053"; description "vomeronasal 2 receptor, pseudogene 8"; gbkey "Gene"; gene "Vom2r-ps8"; gene_biotype "pseudogene"; gene_synonym "RGD1563053"; pseudo "true";
NC_051336.1     Gnomon  gene    284195  301267  .       -       .       gene_id "LOC102556157"; transcript_id ""; db_xref "GeneID:102556157"; db_xref "RGD:7626961"; gbkey "Gene"; gene "LOC102556157"; gene_biotype "protein_coding";
NC_051336.1     Gnomon  gene    307758  313908  .       -       .       gene_id "LOC108352169"; transcript_id ""; db_xref "GeneID:108352169"; db_xref "RGD:11507047"; gbkey "Gene"; gene "LOC108352169"; gene_biotype "lncRNA";
NC_051336.1     Gnomon  transcript      307758  313908  .       -       .       gene_id "LOC108352169"; transcript_id "XR_005497081.1"; db_xref "GeneID:108352169"; gbkey "ncRNA"; gene "LOC108352169"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC108352169, transcript variant X3"; transcript_biotype "lnc_RNA";

Desired output (output.txt):

LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  Vom2r3          NC_051336.1:76909-85762 NC_051336.1     76909   85762           8853    protein_coding
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920   protein_coding
Vom2r4  Vom2r4          NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   protein_coding
Vom2r-ps8       Vom2r-ps8       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669    pseudogene
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072   protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150    lncRNA

For only those lines of the input where $3 is "gene"...
   Column 1: "gene_id" substring of $9
   Column 2: "gene" substring of $9
   Column 3: $1 ":" $4 "-" $5
   Column 4: $1
   Column 5: $4
   Column 6: $5
   Column 7: $7
   Column 8: $5-$4
   Column 9: "gene_biotype" substring of $9

Attempts:

1.

$ cat input.txt | awk '                                                           # provide the input to awk
  BEGIN{FS="\t"}                                                                  # use tab as the field separator
  {split($9,a,";");                                                               # split field 9 into individual fields using semicolon as the field separator
  if($3~"gene")                                                                   # if the third field is "gene"
  print a[1]"\t"a[6]"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"a[7]}' | # print the indicated fields of each line in a tab-delimited fashion
  sed 's/gene_id "//' |                                                           # remove any instance of the string gene_id "
  sed 's/gene "//' |                                                              # remove any instance of the string gene "
  sed 's/gene_biotype "//' |                                                      # remove any instance of the string gene_biotype "
  sed 's/[" ]//g' > output.txt                                                    # remove any instance of the character " globally and send output to file
$ cat output.txt
LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  gbkeyGene       NC_051336.1:76909-85762 NC_051336.1     76909   85762           8853    Vom2r3
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920   protein_coding
Vom2r4  gbkeyGene       NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   Vom2r4
Vom2r-ps8       gbkeyGene       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669    Vom2r-ps8
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072   protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150    lncRNA

This comes really close to what I want... except that output lines 2,4,5 are incorrect. I want the second output field to be a substring of the "gene "" input field and the ninth output field to be a substring of the "gene_biotype "" input field (both desired input fields have variable positions in the semicolon separated series of the ninth field).

While printing a[1] to extract the "gene_id "" field does work since it is ALWAYS in the first field of the ninth semicolon-delimited input field, printing a[6] and a[7] to extract the other fields mentioned above will NOT work because their index position changes for some lines (see 2,4,5 of the output). The obvious solution is to print a field only when it contains the string of interest rather than a specific index position...

2.

$ cat $ANNOTATION | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print {for(i=1;i<=NF;i  ){if($i~/gene_id "/){print a[$i]}}}"\t"{for(i=1;i<=NF;i  ){if($i~/gene "/){print a[$i]}}}"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"{for(i=1;i<=NF;i  ){if($i~/gene_biotype "/){print a[$i]}}}}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt

This awk command is riddled with syntax errors since using { or } or ; after print is not acceptable, thus the command is not executed and there is no output.
The general idea was to replace the index positions with awk-friendly regex code to extract fields containing specific substrings: {for(i=1;i<=NF;i ){if($i~/gene_id "/){print a[$i]}}}. However, this will not work if the syntax is improper, like above.

Note: I am using...

Shell version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
AWK version: GNU Awk 4.0.2
sed version: sed (GNU sed) 4.2.2

CodePudding user response：

Assumptions:

the last/9th field always contains entries for gene, gene_id and gene_biotype (otherwise this answer is going to print an empty string where the missing entry would show up in the output)

One awk idea:

awk '
BEGIN        { FS=OFS="\t" }
$3 == "gene" { delete attr                    # reset attr[] array
               split($NF,a,";")               # split last field on ";"
               for (i in a) {                 # loop through array
                   split(a[i],b,"\"")         # split array entry on double quote; result: b[1]==attribute name, b[2]==attribute value; we do not care about b[3]
                   gsub(/ /,"",b[1])          # strip spaces from the attribute name
                   attr[b[1]]=b[2]            # populate attr[] array
               }
               print attr["gene_id"],attr["gene"],$1":"$4"-"$5,$1,$4,$5,$7,($5-$4),attr["gene_biotype"]
             }
' input.tsv

This generates:

LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  Vom2r3  NC_051336.1:76909-85762 NC_051336.1     76909   85762           8853    protein_coding
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920 protein_coding
Vom2r4  Vom2r4  NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   protein_coding
Vom2r-ps8       Vom2r-ps8       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669  pseudogene
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072 protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150  lncRNA

CodePudding user response：

@Gawain : I didn't verify all the edge cases, but u can use this regex to quickly split the input into discrete components.

From there should be trivial.

  echo "${input}" | 

  {m,g}awk NF=NF           OFS='\f\r\t' ORS='\n^^^\n' \
           FS='[\t ]*[.][ \t]*|(\42[ \t]*[;][ \t\42]*|[,\42:] |[ \t][ \t] )*'
             
NC_051336
    1
    Gnomon
    gene
    40042
    56215
    -
    gene_id 
    LOC102552844
    transcript_id 
    db_xref 
    GeneID
    102552844
    db_xref 
    RGD
    7551986
    gbkey 
    Gene
    gene 
    LOC102552844
    gene_biotype 
    pseudogene
    pseudo 
    true
    
^^^
NC_051336
    1
    BestRefSeq,Gnomon
    gene
    76909
    85762
     
    gene_id 
    Vom2r3
    transcript_id 
    db_xref 
    GeneID
    502213
    db_xref 
    RGD
    1565892
    description 
    vomeronasal 2 receptor
     3
    gbkey 
    Gene
    gene 
    Vom2r3
    gene_biotype 
    protein_coding
    gene_synonym 
    RGD1565892
    
^^^
NC_051336
    1
    Gnomon
    gene
    162525
    192445
    -
    gene_id 
    LOC103693496
    transcript_id 
    db_xref 
    GeneID
    103693496
    db_xref 
    RGD
    9090555
    gbkey 
    Gene
    gene 
    LOC103693496
    gene_biotype 
    protein_coding
    
^^^
NC_051336
    1
    Gnomon
    transcript
    162525
    167758
    -
    gene_id 
    LOC103693496
    transcript_id 
    XM_039098304
    1
    db_xref 
    GeneID
    103693496
    gbkey 
    mRNA
    gene 
    LOC103693496
    model_evidence 
    Supporting evidence includes similarity to
     4 Proteins
     and 90% coverage of the annotated genomic feature by RNAseq alignments
    product 
    sperm motility kinase X-like
     transcript variant X9
    transcript_biotype 
    mRNA
    
^^^
NC_051336
    1
    BestRefSeq
    gene
    226465
    241532
    -
    gene_id 
    Vom2r4
    transcript_id 
    db_xref 
    GeneID
    308248
    db_xref 
    RGD
    1564110
    description 
    vomeronasal 2 receptor
     4
    gbkey 
    Gene
    gene 
    Vom2r4
    gene_biotype 
    protein_coding
    gene_synonym 
    RGD1564110
    
^^^
NC_051336
    1
    BestRefSeq
    transcript
    226465
    241532
    -
    gene_id 
    Vom2r4
    transcript_id 
    NM_001099458
    1
    db_xref 
    GeneID
    308248
    gbkey 
    mRNA
    gene 
    Vom2r4
    product 
    vomeronasal 2 receptor
     4
    transcript_biotype 
    mRNA
    
^^^
NC_051336
    1
    Curated Genomic gene
    267100
    275769
    -
    gene_id 
    Vom2r-ps8
    transcript_id 
    db_xref 
    GeneID
    502214
    db_xref 
    RGD
    1563053
    description 
    vomeronasal 2 receptor
     pseudogene 8
    gbkey 
    Gene
    gene 
    Vom2r-ps8
    gene_biotype 
    pseudogene
    gene_synonym 
    RGD1563053
    pseudo 
    true
    
^^^
NC_051336
    1
    Gnomon
    gene
    284195
    301267
    -
    gene_id 
    LOC102556157
    transcript_id 
    db_xref 
    GeneID
    102556157
    db_xref 
    RGD
    7626961
    gbkey 
    Gene
    gene 
    LOC102556157
    gene_biotype 
    protein_coding
    
^^^
NC_051336
    1
    Gnomon
    gene
    307758
    313908
    -
    gene_id 
    LOC108352169
    transcript_id 
    db_xref 
    GeneID
    108352169
    db_xref 
    RGD
    11507047
    gbkey 
    Gene
    gene 
    LOC108352169
    gene_biotype 
    lncRNA
    
^^^
NC_051336
    1
    Gnomon
    transcript
    307758
    313908
    -
    gene_id 
    LOC108352169
    transcript_id 
    XR_005497081
    1
    db_xref 
    GeneID
    108352169
    gbkey 
    ncRNA
    gene 
    LOC108352169
    model_evidence 
    Supporting evidence includes similarity to
     1 EST
     and 98% coverage of the annotated genomic feature by RNAseq alignments
     including 3 samples with support for all annotated introns
    product 
    uncharacterized LOC108352169
     transcript variant X3
    transcript_biotype 
    lnc_RNA
    
^^^