remove extra characters using awk command-CodePudding

I want to generate a new file from this file but in second column I want the gene version to not be present basically any number coming after . This is the file content:

>  chr gene_id            gene_name start end   gene_type           
    1  ENSG00000223972.4 DDX11L1    11869 14412 pseudogene         
    1  ENSG00000227232.3 WASH7P    14363 29806 pseudogene
> 
> 
> The output should look like:                                   
  chr gene_id         gene_name start end   gene_type              
  1   ENSG00000223972 DDX11L1   11869 14412 pseudogene             
  1   ENSG00000227232 WASH7P    14363 29806 pseudogene

i tried this command: sed $2 's/ *..*//' gene_annot.parsed.txt > gene1.txt

CodePudding user response：

In the simplest of possible variants:

awk '{gsub(/\.[0-9]  /, " ",  $0)}1' genes
> chr     gene_id gene_name       start   end     gene_type 1
> ENSG00000223972       DDX11L1 11869   14412   pseudogene 1
> ENSG00000227232       WASH7P  14363   29806   pseudogene 1
> ENSG00000243485       MIR1302-11      29554   31109   antisense 1
> ENSG00000221311       MIR1302-11      30366   30503   miRNA 1
> ENSG00000237613       FAM138A 34554   36081   protein_coding 1
> ENSG00000240361       OR4G11P 62948   63887   pseudogene 1
> ENSG00000186092       OR4F5   69091   70008   protein_coding

Should there (further down in the file) be other values with a . in other fields this may have undesirable results.

CodePudding user response：

Assuming that . is never present before 2nd column, you might use GNU sed for this as follows, let file.txt content be

>  chr gene_id            gene_name start end   gene_type           
    1  ENSG00000223972.4 DDX11L1    11869 14412 pseudogene         
    1  ENSG00000227232.3 WASH7P    14363 29806 pseudogene

then

sed 's/\.[0-9]*//' file.txt

output

>  chr gene_id            gene_name start end   gene_type           
    1  ENSG00000223972 DDX11L1    11869 14412 pseudogene         
    1  ENSG00000227232 WASH7P    14363 29806 pseudogene

Explanation: for each line replace literal . (note that \ is required, as . has special meaning for GNU sed) followed by zero or more (*) digits ([0-9]) using empty string (i.e. remove it) once.

If you need to use GNU AWK AT ANY PRICE then to get same effect do

awk '{sub(/\.[0-9]*/,"");print}' file.txt

CodePudding user response：

With awk it could be:

awk 'NR > 1 && index($2,".") {sub(/\.[[:digit:]]*/,"",$2)} 1' file
>  chr gene_id            gene_name start end   gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene

double condition: no headers, i.e. NR > 1 and make sure field2 contains a dot char, i. e. index($2,".").
if true, then the action: remove dot and digit(s) of field2. And finally print, 1.

CodePudding user response：

$ awk '{sub(/\..*/,"",$2)} 1' file
chr gene_id            gene_name start end   gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene

or if you prefer visual alignment:

$ awk '{sub(/\..*/,"",$2)} 1' file | column -t
chr  gene_id          gene_name  start  end    gene_type
1    ENSG00000223972  DDX11L1    11869  14412  pseudogene
1    ENSG00000227232  WASH7P     14363  29806  pseudogene