I want to generate a new file from this file but in second column I want the gene version to not be present basically any number coming after . This is the file content:
> chr gene_id gene_name start end gene_type
1 ENSG00000223972.4 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232.3 WASH7P 14363 29806 pseudogene
>
>
> The output should look like:
chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
i tried this command: sed $2 's/ *..*//' gene_annot.parsed.txt > gene1.txt
CodePudding user response:
In the simplest of possible variants:
awk '{gsub(/\.[0-9] /, " ", $0)}1' genes
> chr gene_id gene_name start end gene_type 1
> ENSG00000223972 DDX11L1 11869 14412 pseudogene 1
> ENSG00000227232 WASH7P 14363 29806 pseudogene 1
> ENSG00000243485 MIR1302-11 29554 31109 antisense 1
> ENSG00000221311 MIR1302-11 30366 30503 miRNA 1
> ENSG00000237613 FAM138A 34554 36081 protein_coding 1
> ENSG00000240361 OR4G11P 62948 63887 pseudogene 1
> ENSG00000186092 OR4F5 69091 70008 protein_coding
Should there (further down in the file) be other values with a .
in other fields this may have undesirable results.
CodePudding user response:
Assuming that .
is never present before 2nd column, you might use GNU sed
for this as follows, let file.txt
content be
> chr gene_id gene_name start end gene_type
1 ENSG00000223972.4 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232.3 WASH7P 14363 29806 pseudogene
then
sed 's/\.[0-9]*//' file.txt
output
> chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
Explanation: for each line replace literal .
(note that \
is required, as .
has special meaning for GNU sed
) followed by zero or more (*
) digits ([0-9]
) using empty string (i.e. remove it) once.
If you need to use GNU AWK
AT ANY PRICE then to get same effect do
awk '{sub(/\.[0-9]*/,"");print}' file.txt
CodePudding user response:
With awk
it could be:
awk 'NR > 1 && index($2,".") {sub(/\.[[:digit:]]*/,"",$2)} 1' file
> chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
- double condition: no headers, i.e.
NR > 1
and make sure field2 contains a dot char, i. e.index($2,".")
. - if true, then the action: remove dot and digit(s) of field2. And finally print,
1
.
CodePudding user response:
$ awk '{sub(/\..*/,"",$2)} 1' file
chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene
or if you prefer visual alignment:
$ awk '{sub(/\..*/,"",$2)} 1' file | column -t
chr gene_id gene_name start end gene_type
1 ENSG00000223972 DDX11L1 11869 14412 pseudogene
1 ENSG00000227232 WASH7P 14363 29806 pseudogene