Home > Net >  Add column at beginning of file using awk
Add column at beginning of file using awk

Time:09-17

I have a very simple one liner that works almost perfectly. I want to add a new column to a file that says "non coding or coding" depending on conditions on columns 12 and 3 (if column 12 has substring RNA or mir- and/or column 3 == "pseudogene then column 1 should read non-coding else coding).

#file

X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821"; transcript_id "FBtr0307588"; transcript_symbol "CR32821-RB";
X   FlyBase pseudogene  476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275"; transcript_id "FBtr0070097"; transcript_symbol "CR18275-RA";
X   FlyBase pseudogene  5832298 5832368 .       .   gene_id "FBgn0052761"; gene_symbol "tRNA:Glu-CTC-6-1Psi"; transcript_id "FBtr0070818"; transcript_symbol "tRNA:Glu-CTC-6-1Psi-RA";
X   FlyBase pseudogene  6361496 6362960 .   -   .   gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0070923"; transcript_symbol "swaPsi-RA";
X   FlyBase pseudogene  6361496 6363310 .   -   .   gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0334014"; transcript_symbol "swaPsi-RB";
X   FlyBase gene    20025099    20025170    .       .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
X   FlyBase gene    1482492 1482590 .   -   .   gene_id "FBgn0044508"; gene_symbol "snoRNA:M";
X   FlyBase gene    2330159 2330826 .       .   gene_id "FBgn0053218"; gene_symbol "lncRNA:CR33218";
X   FlyBase gene    3427452 3427523 .   -   .   gene_id "FBgn0052493"; gene_symbol "tRNA:Gln-TTG-2-1";
X   FlyBase gene    3819699 3819770 .       .   gene_id "FBgn0052785"; gene_symbol "tRNA:Gln-CTG-2-1";
X   FlyBase gene    3827622 3827693 .       .   gene_id "FBgn0025118"; gene_symbol "tRNA:Pro-CGG-3-1";
2L  FlyBase gene    825969  833241  .       .   gene_id "FBgn0010583"; gene_symbol "dock";
2L  FlyBase gene    852768  854539  .       .   gene_id "FBgn0020545"; gene_symbol "kraken";
2L  FlyBase gene    855337  856639  .       .   gene_id "FBgn0031288"; gene_symbol "CG13949";
2L  FlyBase gene    860197  861806  .       .   gene_id "FBgn0031289"; gene_symbol "CG13950";
2L  FlyBase gene    877302  878270  .       .   gene_id "FBgn0002936"; gene_symbol "ninaA";

#command 

awk '{ if($12 ~ /RNA/ || $12 ~ /mir-/ || $3 == "pseudogene")  $1="non-coding"; else $1="coding"; print }'  a.gene-pseudogene_all_dmel-all-r6.40.gtf 

The code works but it replaces column 1. Which is not what I want, I want to add this new column before column 1 (so it becomes the new column one).

How can I adjust it?

CodePudding user response:

You can (effectively) prepend a new column by converting $1 into something OFS $1. You aren't really creating a new column ($2 still refers to the original second column and $1 refers to "both" new columns) but that is not important in this case:

awk '{
  x = ( $12~/RNA|mir-/ || $3=="pseudogene" ) ? "non-" : ""
  $1 = x "coding" OFS $1
  print
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf

The technique above can be used to insert before (or after) any column. Because we are prepending before the first column (or if we were appending after the final one), the code can be made more efficient by avoiding the assignment:

awk '{
  x = ( $12~/RNA|mir-/ || $3=="pseudogene" ) ? "non-" : ""
  print x "coding", $0
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf
  • Related