grep a very large number of independant patterns from a big file/directory speparating results-CodePudding

I have a ~740000 patterns flat file to grep on a ~30Gb directory

I have either a directory to check ~30Gb either an already shorted file ~3Gb to manage

and I want to make something like

analyse from(patterns) -> directory -> anyline with pattern >> patternfile

so I might use something like :

awk '
    BEGIN {
        while (getline <"file1") pattern = pattern "|" $0;
        pattern = substr(pattern, 2);
    }
    match($0, pattern) {for(i=1; i<=3; i  ) {getline; print}}
' file2 > file3

but it gives only one big output file and not one per pattern found. (each pattern would result in 7 to 15 lines output in total) or in bash something like this* (where VB3 is already a very smaller test file)*

while read ; do grep -i $REPLY VB3.txt > OUT/$REPLY.outputparID.out ; done < listeID.txt

and so on

but a rapid caclulation gives me an estimation of more than 5 days do get results...

how can I do just the same below 2/3 hours maximum or better ? the difficulty here is that I need to get separated results so the grep -F (-f) method cannot work

CodePudding user response：

You would want to scan the files once for all patterns. The approach should be, load the patterns in memory, check for each pattern, accumulate results per pattern.

something like this should work (untested script)

$ awk 'NR==FNR{pat[$0]=NR; next} 
              {for(p in pat) 
                 if($0~p) {
                    close(file); 
                    file=pat[p]".matches";
                    print > file;
                 }}' patterns.file otherfiles...

I suggest you get a small sample of patterns and small number of files and give it a try.

The filenames are indices of the patterns used, should be OK to look back what those are. Since patterns may have special chars you may not want to use them as filenames directly.

Please post the timings if you can use this or a variation of it.

Another suggestion: Opening/closing thousands of files may have significant time cost. In that case, record the results in a single file but keyed with a pattern (or pattern index). Once done, you can sort the results and split to individual results per key.

Again, untested...

$ awk 'NR==FNR{pat[$0]=NR; next} 
              {for(p in pat) 
                 if($0~p) print pat[p] "\t" $0;
              }' patterns.file otherfiles...  | sort > omni.file

and separate them

$ awk -F'\t' 'prev!=$1 {close(file); prev=$1; file=$1".matches"}
                       {print $2 > file)' omni.file

assumes there is not tab in the results, otherwise either find an unused char as a delimiter or set the $1 to null and reset $0.

CodePudding user response：

If you have a few GB of RAM available then this might be the fastest awk solution:

_{note: because you mentioned grep -F then I assume that your patterns are literal strings instead of regexps}

awk -v FS='^$' '
    BEGIN {
        while ( (getline line < ARGV[1]) > 0 )
            patterns[line]
        delete ARGV[1]
    }
    {
        for ( str in patterns )
            if ( index($0, str) )
                matches[str] = matches[str] $0 "\n"
    }
    END {
        for ( str in patterns )
            printf "%s\n", matches[str]
    }
' patterns.txt dir/*.data

^{remark: I'm not sure what you want to do with the results so I just output them with a blank line in between}

CodePudding user response：

Inspired by @karakfa's "Another suggestion" something like this might work well for you (untested):

grep -hf patterns directory/* |
awk '
    NR==FNR {
        pats[$0]
        next
    }
    {
        for ( pat in pats ) {
            if ( $0 ~ pat ) {
                print pat, $0
            }
        }
    }
' patterns - |
sort |
awk '
    $1 != prev {
        close(out)
        out = $1 ".txt"
        prev = out
    }
    {
        sub(/ .*/,"")
        print > out
    }
'

The above assumes none of your "patterns" contain spaces and that by "pattern" you mean "partial line regexp". If you mean something else then change the regexp parts to use whatever other form of pattern matching you need.

The first grep is so you're only looping through the patterns for each line that you know matches at least one of them.