Home > Blockchain >  For everything in a column, remove text between two periods
For everything in a column, remove text between two periods

Time:07-13

I'm trying to reformat column 4 in "orthologsClassification.tsv" file to include everything except the text between the two periods. I want:

t_gene          t_transcript           q_gene   q_transcript 
ENSG00000213096 ENST00000616028.ZNF254 reg_2133 ENST00000616028.ZNF254.2177
ENSG00000213096 ENST00000616028.ZNF254 reg_2053 ENST00000616028.ZNF254.2637

to become:

t_gene          t_transcript           q_gene   q_transcript  
ENSG00000213096 ENST00000616028.ZNF254 reg_2133 ENST00000616028.2177
ENSG00000213096 ENST00000616028.ZNF254 reg_2053 ENST00000616028.2637

I've already tried:

awk '{sub(/\..*$/, "", $4)} 1' OFS='\t' orthologsClassification.tsv

but this deletes everything including and after the first period in column 4 (so ENST00000616028.ZNF254.2177 becomes ENST00000616028, when I really want ENST00000616028.2177).

Any ideas? Thank you!

CodePudding user response:

$ awk -F. 'BEGIN{OFS=FS}NR>1{$--NF=$NF}1' file
t_gene          t_transcript           q_gene   q_transcript 
ENSG00000213096 ENST00000616028.ZNF254 reg_2133 ENST00000616028.2177
ENSG00000213096 ENST00000616028.ZNF254 reg_2053 ENST00000616028.2637

CodePudding user response:

One awk idea using a regex:

$ awk 'BEGIN{FS=OFS="\t"} FNR>1 {sub(/\.[^.]*\./,".",$4)} 1' orthologsClassification.tsv
t_gene  t_transcript    q_gene  q_transcript
ENSG00000213096 ENST00000616028.ZNF254  reg_2133        ENST00000616028.2177
ENSG00000213096 ENST00000616028.ZNF254  reg_2053        ENST00000616028.2637

Another awk idea using split on $4 to extract the 1st and 3rd period-delimited subfields:

$ awk 'BEGIN{FS=OFS="\t"} FNR>1 {split($4,a,".");$4=a[1]"."a[3]} 1' orthologsClassification.tsv
t_gene  t_transcript    q_gene  q_transcript
ENSG00000213096 ENST00000616028.ZNF254  reg_2133        ENST00000616028.2177
ENSG00000213096 ENST00000616028.ZNF254  reg_2053        ENST00000616028.2637

CodePudding user response:

Given the example you provided, all you need is:

$ sed 's/\.[^\t]*\././' file
t_gene  t_transcript    q_gene  q_transcript
ENSG00000213096 ENST00000616028.ZNF254  reg_2133        ENST00000616028.2177
ENSG00000213096 ENST00000616028.ZNF254  reg_2053        ENST00000616028.2637

CodePudding user response:

awk -F'\.[^ \t]*\.' NF=NF OFS=.
  • Related