I have a file that looks like this:
>1_CCACT_1/1
CCATCATTGGCGTCTACA
>2_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1
ATATGAAGGCTGTGAAGCAAAGCGTC
And I want to make it look like this:
>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
Where the first 1 is original, followed by a #, then the second number is from here (in bold):
5_ATATC_1
Followed by a #, and its followed by this barcode (in bold):
5_ATATC_1
I'm using the last entry just as an example. I have some messy sed scripts that can produce the desired header (sort of) but I can't figure out how to append them back to the original headers. You can't assume that the second number will always be a 1, but you can assume that the order of the file won't change. Open to solutions in any programming language, though I've only tried in bash.
CodePudding user response:
A couple sed
ideas using capture groups:
sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/&#\3#\2/' fasta.dat
sed -E 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat
Both of these generate:
>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
Once satisfied with the result add the -i
flag to overwrite the input file:
sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/&#\3#\2/' fasta.dat
sed -E -i.bak 's/>([^_]*)_([^_]*)_([0-9]*)(.*)/>\1_\2_\3\4#\3#\2/' fasta.dat
$ cat fasta.dat
>1_CCACT_1/1#1#CCACT
CCATCATTGGCGTCTACA
>2_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC
>3_GCTAT_1/1#1#GCTAT
CAAACCCATTAATTTCACATCCGTCC
>4_GTATG_1/1#1#GTATG
TAAGCCAGGTTGGTTTCTATCTTT
>5_ATATC_1/1#1#ATATC
ATATGAAGGCTGTGAAGCAAAGCGTC