I am trying to linearize fasta using awk. I am totally new to it. I have a script
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N ;next;} {printf("%s",$0);} END {printf("\n");}' < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}
I dont understand anything in the < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}
. What is $f
, whats tr
, t
, n
. Where exactly I am supposed to give the input file? Can someone please elaborate?
CodePudding user response:
Where exactly I am supposed to give the input file?
You should probably get and comprehend User Manual for tool before attemping to use it, unless you declared earlier potential damage as acceptable.
tr "\t" "\n"
This is tr
command with 2 arguments, most linux commands are furnished with manual which you can access like so
man tr
popular one have also online versions, for example tr
manpage from where meaning of \t
and \n
might be found
\n
new line
\t
horizontal tab
CodePudding user response:
Let's step through that code piece by piece. First, I'll add some white space to make it more legible:
awk '
/^>/ {
printf("%s%s\t", (N>0?"\n":""), $0);
N ;
next;
}
{
printf("%s",$0);
}
END {
printf("\n");
}
' < $f \
| tr "\t" "\n" \
> ${f/.fasta/_lin.fasta}
Okay. First, $f
is your input file. The code's author expects it to contain .fasta
, presumably at the end, like myfile.fasta
. The <
operator in shell scripts is redundant in this particular case (unless you have an equals sign in the filename since awk
may interpret that as a variable assignment), simply telling awk
to consume the contents of that file.
AWK then comes in and matches lines that start with >
. On those lines, it will print a newline (if N > 0) or else nothing, followed by the contents of the line. It then increments N and skips the next command for that line. Other lines are printed as they're seen. After reading all of the lines of $f
, a final newline is printed.
This awk
code is not very legible. It could be rewritten like this:
awk '
/^>/ && N {
printf "\n";
}
{
print;
}
END {
printf "\n";
}
'
The only tricky piece here is that N
is initially zero, so when you say N
the first time, it returns the value before incrementing (zero = false) and therefore that condition does not trigger. When you say it the second time, it returns the value before the next incrementing (one = true) and therefore that condition triggers. Anything that is not an empty string or a zero evaluates as true.
On one line, and more golfed, that could be awk '/^>/&&N {printf"\n"}1;END{printf"\n"}'
(1;
triggers the default action, which is to print the line).
After awk
, the output is passed to tr
to translate all tabs (\t
) into newlines (\n
). Then the output is piped using the >
operator to write to a file described by the shell replacement ${f/.fasta/_lin.fasta}
, which replaces the first instance of .fasta
in $f
with _lin.fasta
, so our example input file myfile.fasta
is transformed to output file myfile_lin.fasta
.
CodePudding user response:
I'm guessing OP is trying to do something like this combining both the awk
and tr
commands :
1 test
2 123
3 >456
4 mnq
5 >yesthis
6 nothis
7 789
{m,g}awk '{ printf("%.*s%s%.*s", (!__< (___ =_= !__ < NF))*_, RS, $ __, _*(_ !=___),RS) } END { print RS }' FS='^>' ORS=
1 test123>456mnq
2 >yesthis
3 nothis789