I have 48411 K FASTA sequences, each 1555 characters in length (in a single file, 78.3 Mb total) with headers as follows:
CYTC2889-12 HM036578 Homo sapiens
but unfortunately, spaces have been used to delimit text instead of the usual '|' (I think).
I'd like to add '|' to headers so that they become
CYTC2889-12|HM036578|Homo sapiens
I just need to replace the first two spaces. There should not be a pipe in the species name. Thus, the end result should be "Homo sapiens" not "Homo | sapiens".
I'm unsure how to proceed since all spaces would be replaced by a pipe instead of just within the first two identifiers (CYTC2889-12 and HM036578 in the above example) as pointed out by @CharlesDuffy.
It seems like a simple task (?), but I'm getting thrown off by the use of spaces as delimiters (or so I assume this is how spaces are being used).
Any thoughts?
CodePudding user response:
OP hasn't provided a sample set of data showing header and non-header files so based on other fasta-related questions I'm going to guess the only lines with spaces are header lines.
Setup:
$ cat bogus.fasta
>CYTC2889-12 HM036578 Homo sapiens
CCATCATTGGCGTCTACA
>CYTC2889-12 HM036578 Homo sapiens
CCATCATTGGCGTCTACA
>CYTC2889-12 HM036578 Homo sapiens
CCATCATTGGCGTCTACA
>CYTC2889-12 HM036578 Homo sapiens
CCATCATTGGCGTCTACA
One sed
idea to replace the 1st two spaces with a pipe (|
):
$ sed 's/ /|/1;s/ /|/1' bogus.fasta
>CYTC2889-12|HM036578|Homo sapiens
CCATCATTGGCGTCTACA
>CYTC2889-12|HM036578|Homo sapiens
CCATCATTGGCGTCTACA
>CYTC2889-12|HM036578|Homo sapiens
CCATCATTGGCGTCTACA
>CYTC2889-12|HM036578|Homo sapiens
CCATCATTGGCGTCTACA
Where:
- the 1st
s/ /|/1
says to replace the 1st space we find with a pipe; this modified line now becomes the input for the 2nd half of the script where ... - the 2nd
s/ /|/1
also says to replace the 1st space we find with a pipe (which in this case is actually the 2nd space from the original file)
If the results look correct and OP wants to actually modify the original file then ... and assuming using GNU sed
... the -i
flag can be added to force the input file to be updated with the changes, eg:
sed -i 's/ /|/1;s/ /|/1' bogus.fasta