Home > database >  Fastest way to find lines from a large file in another file
Fastest way to find lines from a large file in another file

Time:01-03

I am using grep in a while loop to find lines from one file in another file and saving the output to a new file. My file is quite large (226 million lines) and the script is taking forever (12 days and counting). Do you have a suggestion to speed it up, perhaps there is a better way rather than grep?

(I also need the preceding line for the output, therefore grep -B 1.)

Here is my code:

#!/bin/bash

while IFS= read -r line; do
  grep -B 1 $line K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;
done <21mercounts.bf.trimmedreads.diff.kmers 

Update:

The input file with the lines to look for is 4.7 GB and 226 mio lines and looks like this:

AAAGAAAAAAAAAGCTAAAAT
ATCTCGACGCTCATCTCAGCA
GTTCGTCGGAGAGGAGAGAAC
GAGGACTATAAAATTGTCGCA
GGCTTCAATAATTTGTATAAC
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
AAAAAACTTACCTTAAAAAGT
TTAGTACACAATATCTCCCAA

The file to look in is 26 GB and 2 billion lines and looks like this:

>264638
AAAAAAAAAAAAAAAAAAAAA
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

The expected output would be this:

>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

CodePudding user response:

You can try this grep -f command without shell loop and using a fixed string search:

grep -B1 -Ff 21mercounts.bf.trimmedreads.diff.kmers \
 K33.21mercounts.bf.trimmedreads.dumps.fa > 21mercounts.bf.trimmedreads.diff.kmers.K33

CodePudding user response:

Here's a solution using awk. Not sure if it will be faster than grep or ripgrep, but it is possible due to hash-based lookup. This assumes your RAM is big enough to load the first file (4.7 GB and 226 mio lines).

$ awk 'NR==FNR{a[$1]; next} $0 in a{print p; print} {p=$0}' f1 f2
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

mawk is usually the fastest option, but I have come across examples where gawk is faster, especially for arrays like in this command. If you can install frawk, that can give you even faster results. Command needs to be slightly modified:

frawk 'NR==FNR{a[$1]; next} $0 in a{print p; print $0} {p=$0}' f1 f2

CodePudding user response:

If preserving the original order is not required, using GNU uniq and GNU sed:

{ cat 21mercounts.bf.trimmedreads.diff.kmers
  sed -n 'x;n;G;s/\n//p' K33.21mercounts.bf.trimmedreads.dumps.fa
} | LC_ALL=C sort | uniq -w21 -D |
sed -n 's/\(.*\)>\(.*\)/>\2\n\1/p' > 21mercounts.bf.trimmedreads.diff.kmers.K33

CodePudding user response:

grep can search for many patterns (given in a separate file) simultaneously, so reading K33.21mercounts.bf.trimmedreads.dumps.fa will only be done once. Something like the following might work:

#!/bin/bash

grep --f 21mercounts.bf.trimmedreads.diff.kmers -B 1 K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33; 

However, it probably requires lots of RAM

CodePudding user response:

Any time I deal with files this big, I almost always end up sorting them. Sorts are slow, but take a lot less time that your while read loop that is scanning 2 billion lines 226 million times.

sort 4GB>4gb.srt

and

sed '/>/{N;s/\n/ /}' 26GB |sort -t' ' -k2 >25gb.srt

which will produce a file like this:

>264638 AAAAAAAAAAAAAAAAAAAAA
>1 AAAGAAAAAAAAAGCTAAAAT
>13 AATCATTTTCCGCTGGAGAGA
>1 ATCTCGACGCTCATCTCAGCA
>38 ATTCAATAAATAATAAATTAA
>2 GAGGACTATAAAATTGTCGCA
>1 GGCTTCAATAATTTGTATAAC
>1 GTTCGTCGGAGAGGAGAGAAC
>28 TCTTTTCAGGAGTAATAACAA

Now you only have to read through each file once.

$ cat tst
awk 'BEGIN{ getline key < "4gb.srt"; }
 $2  < key { next; }
 $2  > key { while ($2 > key){ getline key < "4gb.srt"; } }
 $2 == key {  $0=gensub(/ /,"\n",1); print }' 25gb.srt

$ ./tst
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
>1
GTTCGTCGGAGAGGAGAGAAC

The ordering is different from yours, but otherwise does that work?

(Try some tests with smaller files first...)

CodePudding user response:

There are quite a few tools (e.g. ripgrep) and options (-f, -F, and -x) to speed up your basic approach. But all of them are basically the same slow approach as you are using now, "only" sped up by a huge but still constant factor.

For your problem and input sizes, I'd recommend to change the approach altogether. There are many different ways to tackle your problem.

First, lets define some variables to estimate the speedup of those approaches:

Problem

A 26 GB haystack file with h = 1 million entries (description, sequence) = 2 billion lines, that look like

>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
...

A 4.7 GB needles file with n = 226 million lines, each of length m = 21.

GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
...

For all needles, we want to extract the corresponding entries in the haystack (if they exist).

Solutions

We assume n < h and a constant m. Therefore O(n h) = O(h), O(m)=O(1) and so on.

Naive – O(h·n) time

Currently, you are using the naive approach. For each needle, the entire haystack is searched once.

Build a lookup structure and search only once – O( … h·… ) time

Store all needles in a data structure which has a fast contains() operation. Then iterate the haystack and call needles.contains(haystackEntry) for each entry, to decide whether it is something you are searching for.

Currently, your "data structure" is a list, which takes O(1) time to "build" (because it is already in that form) , but O(n) time to query once!

Better data structures exist, e.g.

  • A Trie (= a prefix tree) takes O(n) time to build and O(1) time to query once, resulting in O(n h·1) overall time, so basically O(h) in your case.
    A Trie can be expressed as a regex, so you could stick with grep. E.g. the needles ABC, ABX, and XBC can be stored in the Trie regex ^(AB(C|X)|XBC).
  • A hash map. The time depends on the concrete implementation. On average, it should be possible to populate one in O(n) time and query it in O(1) time too. But keeping 4.7 GB raw data in such a data structure in memory is probably not very efficient.
    This solution can be implemented easily in awk, as done by sundeep.

Either way, data structures and bash don't mix very well. And even if you switched to a better language, would have to re-build or stored and loaded each time you run the program. Therefore it is easier and nearly as efficient to ...

Sort and search only once – O( h·log(h) h ) time

You can search the haystack and the needles, and then iterate the haystack only once.

Take the first needle and search the haystack from the beginning. When reaching a haystack entry that would have to be sorted behind the the current needle, take the next needle and continue the search from your current location.

This can be done easily with. Here we use GNU coreutils to make processing a bit easier, faster, and safer:

export LC_ALL=C  # speeds up sorting

tr \\n \\0 < needles > needles0
sort -z -S66% -o needles0 needles0

awk 'NR%2 {desc=$0; next} {print desc "\f" $0}' ORS=\\0 haystack > haystack0
sort -zt$'\f' -k2,2 -S66% -o haystack0 haystack0

# --nocheck-order is not needed, but speeds up the process
join -zt$'\f' -22 -o 2.1,2.2 --nocheck-order needles0 haystack0 |
tr '\0\f' '\n\n'

This changes the order of the output. If you need the output in the original order, use a Schwartzian transform: Before sorting the needles/haystack, store their line numbers and drag them along through the entire process. At the end, sort the found entries by those line numbers.

  • Related