Home > Software design >  Speed up bash for loop which contains multiple sed commands
Speed up bash for loop which contains multiple sed commands

Time:01-26

my bash for loop looks like:

for i in read_* ; do 
    cut -f1 $i | sponge $i
    sed -i '1 s/^/>/g' $i
    sed -i '3 s/^/>ref\n/g' $i
    sed -i '4d' $i
    sed -i '1h;2H;1,2d;4G' $i
    mv $i $i.fasta
done 

Are there any methods of speeding up this process, perhaps using GNU parallel?

CodePudding user response:

If I read it right, this should create the same result, though it would probably help a LOT if I could see what your input and expected output look like...

awk '{$0=$1}
     FNR==1{hd=">"$0; next}
     FNR==2{hd=hd"\n"$0;next}
     FNR==3{print ">ref\n"$0 > FILENAME".fasta"}
     FNR==4{next}
     FNR==5{print hd"\n"$0 > FILENAME".fasta"}
' read_*

My input files:

$: cat read_x
foo x
bar x
baz x
last x
curiosity x

$: cat read_y
FOO y
BAR y
BAZ y
LAST y
CURIOSITY y

and the resulting output files:

$: cat read_x.fasta
>ref
baz
>foo
bar
curiosity

$: cat read_y.fasta
>ref
BAZ
>FOO
BAR
CURIOSITY

This runs in one pass with no loop aside from awk's usual internals, and leaves the originals in place so you can check it first. If all is good, all that's left is to remove the originals. For that, I would use extended globbing.

$: shopt -s extglob; rm read_!(*.fasta)

That will clean up the original inputs but not the new outputs.
Same results, three commands, no loops.

I am, or course, making some assumptions about what you are meaning to do that might not be accurate. To get this format in a single sed call -

$: sed -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_x
>ref
baz
>foo
bar
curiosity

but that's not the same commands you used, so maybe I'm misreading it.

To use this to in-place edit multiple files at a time (instead of calling it in a loop on each file), use -si so that the line numbers apply to each file rather than the stream of records they collectively produce.

DON'T use -is, though you could use -i -s.

$: sed -s -i -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_*

This still leaves you with the issue of renaming each, but xargs makes that pretty easy in the given example.

printf "%s\n" read_* | xargs -I@ mv @ @.fasta
  • Related