sed using while loop is very slow-CodePudding

I have gff file, the contents are like the following (tab separated):

# start gene 1Chr.g1
1Chr    AUGUSTUS        gene    3636    5916    0.1             .       ID=1Chr.g1
1Chr    AUGUSTUS        transcript      3636    5916    0.1             .       ID=1Chr.g1.t1;Parent=1Chr.g1
1Chr    AUGUSTUS        transcription_start_site        3636    3636    .               .       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        exon    3636    3913    .               .       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        start_codon     3760    3762    .               0       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        intron  3914    3995    1               .       
1Chr    AUGUSTUS        CDS     3760    3913    1               0       ID=1Chr.g1.t1.cds;Parent=1Chr.g1.t1
1Chr    AUGUSTUS        stop_codon      5628    5630    .               0       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        transcription_end_site  5916    5916    .               .       Parent=1Chr.g1.t1
# start gene 1Chr.g2
1Chr    AUGUSTUS        gene    5938    8761    0.17    -       .       ID=1Chr.g2
1Chr    AUGUSTUS        transcript      5938    8761    0.17    -       .       ID=1Chr.g2.t1;Parent=1Chr.g2
1Chr    AUGUSTUS        transcription_end_site  5938    5938    .       -       .       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        exon    5938    6594    .       -       .       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        stop_codon      6428    6430    .       -       0       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        intron  6595    7156    0.8     -       .       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        CDS     6428    6594    0.89    -       2       ID=1Chr.g2.t1.cds;Parent=1Chr.g2.t1
# start gene 2Chr.g1
2Chr    AUGUSTUS        gene    11612   13481   0.09    -       .       ID=2Chr.g1
2Chr    AUGUSTUS        transcript      11612   13481   0.09    -       .       ID=2Chr.g1.t1;Parent=2Chr.g1
2Chr    AUGUSTUS        transcription_end_site  11612   11612   .       -       .       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        exon    11612   13481   .       -       .       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        stop_codon      11864   11866   .       -       0       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        CDS     11864   12940   1       -       0       ID=2Chr.g1.t1.cds;Parent=2Chr.g1.t1
2Chr    AUGUSTUS        start_codon     12938   12940   .       -       0       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        transcription_start_site        13481   13481   .       -       .       Parent=2Chr.g1.t1
# start gene 2Chr.g2
2Chr    AUGUSTUS        gene    22876   31223   0.04            .       ID=2Chr.g2
2Chr    AUGUSTUS        transcript      22876   31223   0.04            .       ID=2Chr.g2.t1;Parent=2Chr.g2
2Chr    AUGUSTUS        transcription_start_site        22876   22876   .               .       Parent=2Chr.g2.t1
2Chr    AUGUSTUS        exon    22876   23456   .               .       Parent=2Chr.g2.t1
2Chr    AUGUSTUS        exon    23515   24451   .               .       Parent=2Chr.g2.t1
2Chr    AUGUSTUS        start_codon     23519   23521   .               0       Parent=2Chr.g2.t1

I want to replace the IDs of the genes which are 1Chr.g1, 1Chr.g2, 2Chr.g1, and 2Chr.g2 to just in sequence like start from g1 to end of the IDs like in this case g4.

Expected Output

# start gene g1
1Chr    AUGUSTUS        gene    3636    5916    0.1             .       ID=g1
1Chr    AUGUSTUS        transcript      3636    5916    0.1             .       ID=g1.t1;Parent=g1
1Chr    AUGUSTUS        transcription_start_site        3636    3636    .               .       Parent=g1.t1
1Chr    AUGUSTUS        exon    3636    3913    .               .       Parent=g1.t1
1Chr    AUGUSTUS        start_codon     3760    3762    .               0       Parent=g1.t1
1Chr    AUGUSTUS        intron  3914    3995    1               .
1Chr    AUGUSTUS        CDS     3760    3913    1               0       ID=g1.t1.cds;Parent=g1.t1
1Chr    AUGUSTUS        stop_codon      5628    5630    .               0       Parent=g1.t1
1Chr    AUGUSTUS        transcription_end_site  5916    5916    .               .       Parent=g1.t1
# start gene g2
1Chr    AUGUSTUS        gene    5938    8761    0.17    -       .       ID=g2
1Chr    AUGUSTUS        transcript      5938    8761    0.17    -       .       ID=g2.t1;Parent=g2
1Chr    AUGUSTUS        transcription_end_site  5938    5938    .       -       .       Parent=g2.t1
1Chr    AUGUSTUS        exon    5938    6594    .       -       .       Parent=g2.t1
1Chr    AUGUSTUS        stop_codon      6428    6430    .       -       0       Parent=g2.t1
1Chr    AUGUSTUS        intron  6595    7156    0.8     -       .       Parent=g2.t1
1Chr    AUGUSTUS        CDS     6428    6594    0.89    -       2       ID=g2.t1.cds;Parent=g2.t1
# start gene g3
2Chr    AUGUSTUS        gene    11612   13481   0.09    -       .       ID=g3
2Chr    AUGUSTUS        transcript      11612   13481   0.09    -       .       ID=g3.t1;Parent=g3
2Chr    AUGUSTUS        transcription_end_site  11612   11612   .       -       .       Parent=g3.t1
2Chr    AUGUSTUS        exon    11612   13481   .       -       .       Parent=g3.t1
2Chr    AUGUSTUS        stop_codon      11864   11866   .       -       0       Parent=g3.t1
2Chr    AUGUSTUS        CDS     11864   12940   1       -       0       ID=g3.t1.cds;Parent=g3.t1
2Chr    AUGUSTUS        start_codon     12938   12940   .       -       0       Parent=g3.t1
2Chr    AUGUSTUS        transcription_start_site        13481   13481   .       -       .       Parent=g3.t1
# start gene g4
2Chr    AUGUSTUS        gene    22876   31223   0.04            .       ID=g4
2Chr    AUGUSTUS        transcript      22876   31223   0.04            .       ID=g4.t1;Parent=g4
2Chr    AUGUSTUS        transcription_start_site        22876   22876   .               .       Parent=g4.t1
2Chr    AUGUSTUS        exon    22876   23456   .               .       Parent=g4.t1
2Chr    AUGUSTUS        exon    23515   24451   .               .       Parent=g4.t1
2Chr    AUGUSTUS        start_codon     23519   23521   .               0       Parent=g4.t1

I wrote the following bash script, but it took too long, as I tried to count its time, so for one sed it took 1 second, and if there are 28000 iterations it will take about 8 hours, which is too much time. Is there any efficient way to do this?

awk '$3 == "gene"' $1 |cut -f9 |grep -o "=.*" |sed -e 's/=//g' >LIST.txt


COUNTER=0
cat LIST.txt | while read line; do
        COUNTER=$(expr $COUNTER   1)
        echo "sed -i 's/$line/g$COUNTER/g' $1" |bash
done
rm LIST.txt

Another thing, generate a file sedTG45 which is very annoying.

CodePudding user response：

You are running sed on the same file up to 28,000 times. With a bit of preprocessing, it's not hard to get that down to once.

This is untested, but should at the very least point you in the general direction of Awk. This is a small language; you can learn it in less than an hour, and get quite good at it in a few weeks.

awk -F '\t' '$3 == "gene" {
    g=$9; sub(/^[^=]*=/, "", g); gsub(/=/, "", g);
    a[g] = "g"   n }
  { for(k in a) gsub(k, a[k]) }1' "$1"

In very brief n maintains the counter and a is an associative array of all the genes we have plucked out.

This assumes that the definition precedes the occurrences which should be replaced. If that's not a valid assumption, you'll need two iterations over the file, but the first will be read-only, and so this should still be significantly faster than your brute-force attempt.

Addendum: If you were hellbent on sticking to your (updated) code, moving the pipe to Bash (or maybe just sed) after done would improve it significantly. Here's a light refactoring. It would still be prettier in Awk.

# Use lower case for private variable
counter=0

# Note quotes around $1
# and use of pipe instead of temp file
awk '$3 == "gene"' "$1" |
cut -f9 |
grep -o "=.*" |
sed -e 's/=//g' |
# note IFS='' and read -r
while IFS='' read -r line; do
    # avoid paleolithic expr
    ((counter  ))
    echo "s/$line/g$COUNTER/g"
done |
# pipe output to sed -i
sed -i -f - "$1"

Not all seds allow -f -. There will inevitably be a temporary file while sed -i runs but it gets deleted when it's done.

Demo, with timings: https://ideone.com/GghqiW

CodePudding user response：

Generating sed commands first might help. The inputfile will be read twice, but that should not be a problem. Please check the performance.
The sed commands you want look like

s/1Chr.g1.t1/g1/g
s/1Chr.g2.t1/g2/g
s/2Chr.g1.t1/g3/g
s/2Chr.g2.t1/g4/g

When numbers grow above 9, these commands can be wrong, so change it a little:

awk '
  ($0 != last) {
    n  ;
    printf("s/([^0-9])%s([^0-9]|$)/\\1g%s\\2/g\n", $0, n)
  }
  {
    last=$0
  }' <(grep -Eo "[0-9] Chr.g[0-9] " < file)

Returns

s/([^0-9])1Chr.g1([^0-9]|$)/\1g1\2/g
s/([^0-9])1Chr.g2([^0-9]|$)/\1g2\2/g
s/([^0-9])2Chr.g1([^0-9]|$)/\1g3\2/g
s/([^0-9])2Chr.g2([^0-9]|$)/\1g4\2/g

Now use these commands in sed (replace file with something like $1 on 2 places, add -i to sed when it looks ok):

sed -r -f <(
    awk '
      ($0 != last) {
        n  ;
        printf("s/([^0-9])%s([^0-9]|$)/\\1g%s\\2/g\n", $0, n)
      }
      {
        last=$0
      }' <(grep -Eo "[0-9] Chr.g[0-9] " < file)
  ) file