How can I remove numbers after a specific character with sed?-CodePudding

I need to modify a .fasta file that looks like this:

>Contig_1;2
AGATC...
>Contig_2;345
AaGGC...
>Contig_3;22
GGAGA...

And transform it into something like:

>Contig_1
AGATC...
>Contig_2
AaGGC...
>Contig_3
GGAGA...

I tried doing the following, but it did not work as intended.

sed -i 's/;*\n/\n/g' file.fasta

Could someone give me some advice? Thanks!

CodePudding user response：

You can use

sed -i 's/;[^;]*$//' file.fasta

See the online demo:

s='>Contig_1;2
AGATC...
>Contig_2;345
AaGGC...
>Contig_3;22
GGAGA...'
sed 's/;[^;]*$//' <<< "$s"

Output:

>Contig_1
AGATC...
>Contig_2
AaGGC...
>Contig_3
GGAGA...

Note that sed does not place the newline into the pattern space (since you are using a GNU sed, you could force it to do so with -z, but it is not necessary here), and you can't match a newline with \n in your sed command.

The ;[^;]*$ pattern matches

; - a semi-colon
[^;]* - any zero or more chars other than ; (if you need to make sure you match digits, replace with [0-9]* or [[:digit:]]*)
$ - end of string.

Note you need no g flag here since this command needs to perform a single replacement per line.

CodePudding user response：

1st solution: You can keep it very simple with awk. Written with your shown samples, in GNU awk.

awk -F';' '/^>/{NF=1} 1'  Input_file

Explanation: Simple explanation would be, setting field separator as ; for all the lines. In main program checking condition if line starts from > then set NF(number of fields) to 1 which will keep values just before first occurrence of ;.

2nd solution you want to look for all types of lines(either start with > or not) and you need values just before first occurrence of ; then try following solution:

awk -F';' '/;/{NF=1} 1' Input_file