I have a ~740000 patterns flat file to grep on a ~30Gb directory
I have either a directory to check ~30Gb either an already shorted file ~3Gb to manage
and I want to make something like
analyse from(patterns) -> directory -> anyline with pattern >> patternfile
so I might use something like :
awk '
BEGIN {
while (getline <"file1") pattern = pattern "|" $0;
pattern = substr(pattern, 2);
}
match($0, pattern) {for(i=1; i<=3; i ) {getline; print}}
' file2 > file3
but it gives only one big output file and not one per pattern found. (each pattern would result in 7 to 15 lines output in total) or in bash something like this* (where VB3 is already a very smaller test file)*
while read ; do grep -i $REPLY VB3.txt > OUT/$REPLY.outputparID.out ; done < listeID.txt
and so on
but a rapid caclulation gives me an estimation of more than 5 days do get results...
how can I do just the same below 2/3 hours maximum or better ? the difficulty here is that I need to get separated results so the grep -F (-f) method cannot work
CodePudding user response:
You would want to scan the files once for all patterns. The approach should be, load the patterns in memory, check for each pattern, accumulate results per pattern.
something like this should work (untested script)
$ awk 'NR==FNR{pat[$0]=NR; next}
{for(p in pat)
if($0~p) {
close(file);
file=pat[p]".matches";
print > file;
}}' patterns.file otherfiles...
I suggest you get a small sample of patterns and small number of files and give it a try.
The filenames are indices of the patterns used, should be OK to look back what those are. Since patterns may have special chars you may not want to use them as filenames directly.
Please post the timings if you can use this or a variation of it.
Another suggestion: Opening/closing thousands of files may have significant time cost. In that case, record the results in a single file but keyed with a pattern (or pattern index). Once done, you can sort the results and split to individual results per key.
Again, untested...
$ awk 'NR==FNR{pat[$0]=NR; next}
{for(p in pat)
if($0~p) print pat[p] "\t" $0;
}' patterns.file otherfiles... | sort > omni.file
and separate them
$ awk -F'\t' 'prev!=$1 {close(file); prev=$1; file=$1".matches"}
{print $2 > file)' omni.file
assumes there is not tab in the results, otherwise either find an unused char as a delimiter or set the $1 to null and reset $0.
CodePudding user response:
If you have a few GB of RAM available then this might be the fastest awk
solution:
note: because you mentioned grep -F
then I assume that your patterns are literal strings instead of regexps
awk -v FS='^$' '
BEGIN {
while ( (getline line < ARGV[1]) > 0 )
patterns[line]
delete ARGV[1]
}
{
for ( str in patterns )
if ( index($0, str) )
matches[str] = matches[str] $0 "\n"
}
END {
for ( str in patterns )
printf "%s\n", matches[str]
}
' patterns.txt dir/*.data
remark: I'm not sure what you want to do with the results so I just output them with a blank line in between
CodePudding user response:
Inspired by @karakfa's "Another suggestion" something like this might work well for you (untested):
grep -hf patterns directory/* |
awk '
NR==FNR {
pats[$0]
next
}
{
for ( pat in pats ) {
if ( $0 ~ pat ) {
print pat, $0
}
}
}
' patterns - |
sort |
awk '
$1 != prev {
close(out)
out = $1 ".txt"
prev = out
}
{
sub(/ .*/,"")
print > out
}
'
The above assumes none of your "patterns" contain spaces and that by "pattern" you mean "partial line regexp". If you mean something else then change the regexp parts to use whatever other form of pattern matching you need.
The first grep
is so you're only looping through the patterns for each line that you know matches at least one of them.