append modified ID to fasta file ID-CodePudding

I have a file that looks like this:

>1_CCACT_1/1
CCATCATTGGCGTCTACA
>2_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC

And I want to make it look like this:

>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC

Where the first 1 is original, followed by a #, then the second number is from here (in bold):

5_ATATC_1

Followed by a #, and its followed by this barcode (in bold):

5_ATATC_1

I'm using the last entry just as an example. I have some messy sed scripts that can produce the desired header (sort of) but I can't figure out how to append them back to the original headers. You can't assume that the second number will always be a 1, but you can assume that the order of the file won't change. Open to solutions in any programming language, though I've only tried in bash.

CodePudding user response：

A couple sed ideas using capture groups:

sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/&#\3#\2/'           fasta.dat
sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat

Both of these generate:

>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC

Once satisfied with the result add the -i flag to overwrite the input file:

sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/&#\3#\2/'           fasta.dat
sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat

$ cat fasta.dat
>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC