Home > Software engineering >  What do $f, \t", "\n" mean when linearizing fasta using the awk?
What do $f, \t", "\n" mean when linearizing fasta using the awk?

Time:07-07

I am trying to linearize fasta using awk. I am totally new to it. I have a script

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N  ;next;} {printf("%s",$0);} END {printf("\n");}'  < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}

I dont understand anything in the < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}. What is $f, whats tr, t, n. Where exactly I am supposed to give the input file? Can someone please elaborate?

CodePudding user response:

Where exactly I am supposed to give the input file?

You should probably get and comprehend User Manual for tool before attemping to use it, unless you declared earlier potential damage as acceptable.

tr "\t" "\n"

This is tr command with 2 arguments, most linux commands are furnished with manual which you can access like so

man tr

popular one have also online versions, for example tr manpage from where meaning of \t and \n might be found

\n
    new line
\t
    horizontal tab

CodePudding user response:

Let's step through that code piece by piece. First, I'll add some white space to make it more legible:

awk '
  /^>/ {
    printf("%s%s\t", (N>0?"\n":""), $0);
    N  ;
    next;
  }
  {
    printf("%s",$0);
  }
  END {
    printf("\n");
  }
' < $f \
  | tr "\t" "\n" \
  > ${f/.fasta/_lin.fasta}

Okay. First, $f is your input file. The code's author expects it to contain .fasta, presumably at the end, like myfile.fasta. The < operator in shell scripts is redundant in this particular case (unless you have an equals sign in the filename since awk may interpret that as a variable assignment), simply telling awk to consume the contents of that file.

AWK then comes in and matches lines that start with >. On those lines, it will print a newline (if N > 0) or else nothing, followed by the contents of the line. It then increments N and skips the next command for that line. Other lines are printed as they're seen. After reading all of the lines of $f, a final newline is printed.

This awk code is not very legible. It could be rewritten like this:

awk '
  /^>/ && N   {
    printf "\n";
  }
  {
    print;
  }
  END {
    printf "\n";
  }
'

The only tricky piece here is that N is initially zero, so when you say N the first time, it returns the value before incrementing (zero = false) and therefore that condition does not trigger. When you say it the second time, it returns the value before the next incrementing (one = true) and therefore that condition triggers. Anything that is not an empty string or a zero evaluates as true.

On one line, and more golfed, that could be awk '/^>/&&N {printf"\n"}1;END{printf"\n"}' (1; triggers the default action, which is to print the line).

After awk, the output is passed to tr to translate all tabs (\t) into newlines (\n). Then the output is piped using the > operator to write to a file described by the shell replacement ${f/.fasta/_lin.fasta}, which replaces the first instance of .fasta in $f with _lin.fasta, so our example input file myfile.fasta is transformed to output file myfile_lin.fasta.

CodePudding user response:

I'm guessing OP is trying to do something like this combining both the awk and tr commands :

 1  test
 2  123
 3  >456
 4  mnq
 5  >yesthis
 6  nothis
 7  789
{m,g}awk '{ 
       printf("%.*s%s%.*s",
             (!__< (___  =_= !__ <  NF))*_,
              RS, $ __, _*(_ !=___),RS)
} END { 
       print RS }' FS='^>' ORS= 
 1  test123>456mnq
 2  >yesthis
 3  nothis789   
  • Related