extracting text from one file and creating new file with that text using linux/bash-CodePudding

I have a sequence file that has a repeated pattern that looks like this:

$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"

$>g104 | effector probability: 0.65 
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH

$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS

and so on. I want to extract the text between and including each >g## and create a new file titled protein_g##.faa In the above example it would create a file called "protein_g34.faa" and it would be:

$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS

I was trying to use sed but I am not very experienced using it. My guess was something like this:

$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"

but I can clearly tell that that is wrong... maybe the right thing is using awk? Thanks!

CodePudding user response：

Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.

Here's how I would write that:

< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'

Breaking it down into details:

< input.txt This part reads in the input file.
awk Runs awk.
/^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
(If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.

Hope that helps!

CodePudding user response：

If awk is an option:

awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat

CodePudding user response：

awk is better than sed for this problem. You can implement it in sed with

sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file

This solution is nice for showing some sed tricks:

-z for parsing fragments that span several lines
(..) for remembering strings
\$ matching a literal $
[^\n]* matching until end of line
'\'' for a single quote
End single quoted string, escape single quote and start new single quoted string
\2 for recalling the second remembered string
Write a bash command in the replacement string
e execute result of replacement