I have gff
file, the contents are like the following (tab separated):
# start gene 1Chr.g1
1Chr AUGUSTUS gene 3636 5916 0.1 . ID=1Chr.g1
1Chr AUGUSTUS transcript 3636 5916 0.1 . ID=1Chr.g1.t1;Parent=1Chr.g1
1Chr AUGUSTUS transcription_start_site 3636 3636 . . Parent=1Chr.g1.t1
1Chr AUGUSTUS exon 3636 3913 . . Parent=1Chr.g1.t1
1Chr AUGUSTUS start_codon 3760 3762 . 0 Parent=1Chr.g1.t1
1Chr AUGUSTUS intron 3914 3995 1 .
1Chr AUGUSTUS CDS 3760 3913 1 0 ID=1Chr.g1.t1.cds;Parent=1Chr.g1.t1
1Chr AUGUSTUS stop_codon 5628 5630 . 0 Parent=1Chr.g1.t1
1Chr AUGUSTUS transcription_end_site 5916 5916 . . Parent=1Chr.g1.t1
# start gene 1Chr.g2
1Chr AUGUSTUS gene 5938 8761 0.17 - . ID=1Chr.g2
1Chr AUGUSTUS transcript 5938 8761 0.17 - . ID=1Chr.g2.t1;Parent=1Chr.g2
1Chr AUGUSTUS transcription_end_site 5938 5938 . - . Parent=1Chr.g2.t1
1Chr AUGUSTUS exon 5938 6594 . - . Parent=1Chr.g2.t1
1Chr AUGUSTUS stop_codon 6428 6430 . - 0 Parent=1Chr.g2.t1
1Chr AUGUSTUS intron 6595 7156 0.8 - . Parent=1Chr.g2.t1
1Chr AUGUSTUS CDS 6428 6594 0.89 - 2 ID=1Chr.g2.t1.cds;Parent=1Chr.g2.t1
# start gene 2Chr.g1
2Chr AUGUSTUS gene 11612 13481 0.09 - . ID=2Chr.g1
2Chr AUGUSTUS transcript 11612 13481 0.09 - . ID=2Chr.g1.t1;Parent=2Chr.g1
2Chr AUGUSTUS transcription_end_site 11612 11612 . - . Parent=2Chr.g1.t1
2Chr AUGUSTUS exon 11612 13481 . - . Parent=2Chr.g1.t1
2Chr AUGUSTUS stop_codon 11864 11866 . - 0 Parent=2Chr.g1.t1
2Chr AUGUSTUS CDS 11864 12940 1 - 0 ID=2Chr.g1.t1.cds;Parent=2Chr.g1.t1
2Chr AUGUSTUS start_codon 12938 12940 . - 0 Parent=2Chr.g1.t1
2Chr AUGUSTUS transcription_start_site 13481 13481 . - . Parent=2Chr.g1.t1
# start gene 2Chr.g2
2Chr AUGUSTUS gene 22876 31223 0.04 . ID=2Chr.g2
2Chr AUGUSTUS transcript 22876 31223 0.04 . ID=2Chr.g2.t1;Parent=2Chr.g2
2Chr AUGUSTUS transcription_start_site 22876 22876 . . Parent=2Chr.g2.t1
2Chr AUGUSTUS exon 22876 23456 . . Parent=2Chr.g2.t1
2Chr AUGUSTUS exon 23515 24451 . . Parent=2Chr.g2.t1
2Chr AUGUSTUS start_codon 23519 23521 . 0 Parent=2Chr.g2.t1
I want to replace the IDs of the genes which are 1Chr.g1
, 1Chr.g2
, 2Chr.g1
, and 2Chr.g2
to just in sequence like start from g1
to end of the IDs like in this case g4
.
Expected Output
# start gene g1
1Chr AUGUSTUS gene 3636 5916 0.1 . ID=g1
1Chr AUGUSTUS transcript 3636 5916 0.1 . ID=g1.t1;Parent=g1
1Chr AUGUSTUS transcription_start_site 3636 3636 . . Parent=g1.t1
1Chr AUGUSTUS exon 3636 3913 . . Parent=g1.t1
1Chr AUGUSTUS start_codon 3760 3762 . 0 Parent=g1.t1
1Chr AUGUSTUS intron 3914 3995 1 .
1Chr AUGUSTUS CDS 3760 3913 1 0 ID=g1.t1.cds;Parent=g1.t1
1Chr AUGUSTUS stop_codon 5628 5630 . 0 Parent=g1.t1
1Chr AUGUSTUS transcription_end_site 5916 5916 . . Parent=g1.t1
# start gene g2
1Chr AUGUSTUS gene 5938 8761 0.17 - . ID=g2
1Chr AUGUSTUS transcript 5938 8761 0.17 - . ID=g2.t1;Parent=g2
1Chr AUGUSTUS transcription_end_site 5938 5938 . - . Parent=g2.t1
1Chr AUGUSTUS exon 5938 6594 . - . Parent=g2.t1
1Chr AUGUSTUS stop_codon 6428 6430 . - 0 Parent=g2.t1
1Chr AUGUSTUS intron 6595 7156 0.8 - . Parent=g2.t1
1Chr AUGUSTUS CDS 6428 6594 0.89 - 2 ID=g2.t1.cds;Parent=g2.t1
# start gene g3
2Chr AUGUSTUS gene 11612 13481 0.09 - . ID=g3
2Chr AUGUSTUS transcript 11612 13481 0.09 - . ID=g3.t1;Parent=g3
2Chr AUGUSTUS transcription_end_site 11612 11612 . - . Parent=g3.t1
2Chr AUGUSTUS exon 11612 13481 . - . Parent=g3.t1
2Chr AUGUSTUS stop_codon 11864 11866 . - 0 Parent=g3.t1
2Chr AUGUSTUS CDS 11864 12940 1 - 0 ID=g3.t1.cds;Parent=g3.t1
2Chr AUGUSTUS start_codon 12938 12940 . - 0 Parent=g3.t1
2Chr AUGUSTUS transcription_start_site 13481 13481 . - . Parent=g3.t1
# start gene g4
2Chr AUGUSTUS gene 22876 31223 0.04 . ID=g4
2Chr AUGUSTUS transcript 22876 31223 0.04 . ID=g4.t1;Parent=g4
2Chr AUGUSTUS transcription_start_site 22876 22876 . . Parent=g4.t1
2Chr AUGUSTUS exon 22876 23456 . . Parent=g4.t1
2Chr AUGUSTUS exon 23515 24451 . . Parent=g4.t1
2Chr AUGUSTUS start_codon 23519 23521 . 0 Parent=g4.t1
I wrote the following bash script, but it took too long, as I tried to count its time, so for one sed
it took 1 second, and if there are 28000
iterations it will take about 8 hours, which is too much time.
Is there any efficient way to do this?
awk '$3 == "gene"' $1 |cut -f9 |grep -o "=.*" |sed -e 's/=//g' >LIST.txt
COUNTER=0
cat LIST.txt | while read line; do
COUNTER=$(expr $COUNTER 1)
echo "sed -i 's/$line/g$COUNTER/g' $1" |bash
done
rm LIST.txt
Another thing, generate a file sedTG45
which is very annoying.
CodePudding user response:
You are running sed
on the same file up to 28,000 times. With a bit of preprocessing, it's not hard to get that down to once.
This is untested, but should at the very least point you in the general direction of Awk. This is a small language; you can learn it in less than an hour, and get quite good at it in a few weeks.
awk -F '\t' '$3 == "gene" {
g=$9; sub(/^[^=]*=/, "", g); gsub(/=/, "", g);
a[g] = "g" n }
{ for(k in a) gsub(k, a[k]) }1' "$1"
In very brief n
maintains the counter and a
is an associative array of all the genes we have plucked out.
This assumes that the definition precedes the occurrences which should be replaced. If that's not a valid assumption, you'll need two iterations over the file, but the first will be read-only, and so this should still be significantly faster than your brute-force attempt.
Addendum: If you were hellbent on sticking to your (updated) code, moving the pipe to Bash (or maybe just sed
) after done
would improve it significantly. Here's a light refactoring. It would still be prettier in Awk.
# Use lower case for private variable
counter=0
# Note quotes around $1
# and use of pipe instead of temp file
awk '$3 == "gene"' "$1" |
cut -f9 |
grep -o "=.*" |
sed -e 's/=//g' |
# note IFS='' and read -r
while IFS='' read -r line; do
# avoid paleolithic expr
((counter ))
echo "s/$line/g$COUNTER/g"
done |
# pipe output to sed -i
sed -i -f - "$1"
Not all sed
s allow -f -
.
There will inevitably be a temporary file while sed -i
runs but it gets deleted when it's done.
Demo, with timings: https://ideone.com/GghqiW
CodePudding user response:
Generating sed
commands first might help. The inputfile will be read twice, but that should not be a problem. Please check the performance.
The sed
commands you want look like
s/1Chr.g1.t1/g1/g
s/1Chr.g2.t1/g2/g
s/2Chr.g1.t1/g3/g
s/2Chr.g2.t1/g4/g
When numbers grow above 9, these commands can be wrong, so change it a little:
awk '
($0 != last) {
n ;
printf("s/([^0-9])%s([^0-9]|$)/\\1g%s\\2/g\n", $0, n)
}
{
last=$0
}' <(grep -Eo "[0-9] Chr.g[0-9] " < file)
Returns
s/([^0-9])1Chr.g1([^0-9]|$)/\1g1\2/g
s/([^0-9])1Chr.g2([^0-9]|$)/\1g2\2/g
s/([^0-9])2Chr.g1([^0-9]|$)/\1g3\2/g
s/([^0-9])2Chr.g2([^0-9]|$)/\1g4\2/g
Now use these commands in sed
(replace file with something like $1
on 2 places, add -i
to sed
when it looks ok):
sed -r -f <(
awk '
($0 != last) {
n ;
printf("s/([^0-9])%s([^0-9]|$)/\\1g%s\\2/g\n", $0, n)
}
{
last=$0
}' <(grep -Eo "[0-9] Chr.g[0-9] " < file)
) file