I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on. I want to extract the text between and including each >g## and create a new file titled protein_g##.faa In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk? Thanks!
CodePudding user response:
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt
This part reads in the input file.awk
Runs awk./^\$>/
On lines which start with the literal string$>
, run the piece of code in brackets.- (If previous step matched)
{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname}
Take the first field in the previous line. Remove the first two characters of that field. Surround that withprotein_ .faa
. Save it as the variable fname. Print a message about switching files. - This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname}
Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
CodePudding user response:
If awk
is an option:
awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat
CodePudding user response:
awk
is better than sed
for this problem. You can implement it in sed
with
sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file
This solution is nice for showing some sed
tricks:
-z
for parsing fragments that span several lines(..)
for remembering strings\$
matching a literal$
[^\n]*
matching until end of line'\''
for a single quote
End single quoted string, escape single quote and start new single quoted string\2
for recalling the second remembered string- Write a bash command in the replacement string
e
execute result of replacement