Home > database >  extracting text from one file and creating new file with that text using linux/bash
extracting text from one file and creating new file with that text using linux/bash

Time:12-15

I have a sequence file that has a repeated pattern that looks like this:

$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"

$>g104 | effector probability: 0.65 
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH

$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS

and so on. I want to extract the text between and including each >g## and create a new file titled protein_g##.faa In the above example it would create a file called "protein_g34.faa" and it would be:

$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS

I was trying to use sed but I am not very experienced using it. My guess was something like this:

$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"

but I can clearly tell that that is wrong... maybe the right thing is using awk? Thanks!

CodePudding user response:

Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.

Here's how I would write that:

< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'

Breaking it down into details:

  1. < input.txt This part reads in the input file.
  2. awk Runs awk.
  3. /^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
  4. (If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
  5. This next block has no condition before it. Implicitly, that means that it matches every line.
  6. {print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.

Hope that helps!

CodePudding user response:

If awk is an option:

awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat

CodePudding user response:

awk is better than sed for this problem. You can implement it in sed with

sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file

This solution is nice for showing some sed tricks:

  • -z for parsing fragments that span several lines
  • (..) for remembering strings
  • \$ matching a literal $
  • [^\n]* matching until end of line
  • '\'' for a single quote
    End single quoted string, escape single quote and start new single quoted string
  • \2 for recalling the second remembered string
  • Write a bash command in the replacement string
  • e execute result of replacement
  • Related