Hi there I've been playing a bit with for loops in BASH to edit a FASTA file.
The file has 24 headers that start with '>' character, as follow:
>CP068277.2
>CP068276.2
>CP068275.2
>CP068274.2
>CP068273.2
>CP068272.2
>CP068271.2
>CP068270.2
>CP068269.2
>CP068268.2
>CP068267.2
>CP068266.2
>CP068265.2
>CP068264.2
>CP068263.2
>CP068262.2
>CP068261.2
>CP068260.2
>CP068259.2
>CP068258.2
>CP068257.2
>CP068256.2
>CP068255.2
>CP086569.2
These are actually chromosomes and I need them to be in the form of >chm1
, >chm2
, etc.
I wrote the following for loop:
for ((c=1; c<=24; c ));
do
sed 's/>/>chr'"$c"' /' CHM13v2.0_no-mito.fna > CHM13v2.0_no-mito_trial.fna;
done
The output is, however, showing only >chm24
without accounting for the count operation (see below)..., anyone has any idea why?
>chr24 CP068277.2
>chr24 CP068276.2
>chr24 CP068275.2
>chr24 CP068274.2
>chr24 CP068273.2
>chr24 CP068272.2
>chr24 CP068271.2
>chr24 CP068270.2
>chr24 CP068269.2
>chr24 CP068268.2
>chr24 CP068267.2
>chr24 CP068266.2
>chr24 CP068265.2
>chr24 CP068264.2
>chr24 CP068263.2
>chr24 CP068262.2
>chr24 CP068261.2
>chr24 CP068260.2
>chr24 CP068259.2
>chr24 CP068258.2
>chr24 CP068257.2
>chr24 CP068256.2
>chr24 CP068255.2
>chr24 CP086569.2
P.S. no worries for the sequences following the >chm24
, I have a way to remove them with sed
; nonetheless, it would be nice to have everything done in one step
Thanks in advance!
CodePudding user response:
Your loop is overwriting the output file on each iteration, the syntax for what you're trying to do would be:
for ((c=1; c<=24; c ));
do
sed 's/>/>chr'"$c"' /' CHM13v2.0_no-mito.fna
done > CHM13v2.0_no-mito_trial.fna
but this would be orders of magnitude more efficient and doesn't hard-code how many header lines you hope the file contains:
awk 'sub(/>/,""){$0=">chr" ( c) " " $0} 1' CHM13v2.0_no-mito.fna > CHM13v2.0_no-mito_trial.fna
CodePudding user response:
In each iteration of the loop, you store the output to CHM13v2.0_no-mito_trial.fna, overwriting the file. So, that file will only see the last iteration.
If you want all iterations, try replacing that line with:
sed 's/>/>chr'"$c"' /' CHM13v2.0_no-mito.fna >> CHM13v2.0_no-mito_trial.fna;
If you want each line to have $c placed on only that line. Try changing the sed edit to edit only that line, eg:
sed ${c},${c}'s/>/>'"${c}"'/'
But, you will need to deal with not appending the unmatching lines to the output file.