Speed up fuzzy regex search Python-CodePudding

I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.

What I am trying to do

I am fuzzy searching for keywords, stored in a dictionary d (with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings, stored in a file testFile (one string each line, ~150 characters each) - 4 mismatches max.

d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords

How I do it

For this, I first compile my regex and store them in a dictionary compd. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.

Here is how I am doing it:

#/usr/bin/env python3

import argparse
import regex

parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()

#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}

#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []}  #to store regex
for k, v in d.items():  #for each keyword
    compd[k].append(regex.compile(r'(?b)('   v[0]   '){s<=4}')) #compile 1st elt of list 
    compd[k].append(regex.compile(r'(?b)('   v[1]   '){s<=4}')) #compile second

#Search keywords
with open(args.file) as f:  #open file with strings
    line = f.readline()  #first line/string
    while line:  #go through each line
        for k, v in compd.items():  #for each keyword (ID, regex)
            for val in [v[0], v[1]]:  #for each elt of list
                found = val.search(line)  #regex search
                if found != None:  #if match
                    print("Keyword "   k   " found as "   found[0]) #print match
                    if val == v[0]:   #if 1st elt of list
                        break   #don't search 2nd
        line = f.readline() #next line

I have tested the script using the testFile:

AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT

and get the following expected result: Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC

Efficiency

With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.

CodePudding user response：

Having a separate regex for each keyword requires running a match against each regex separately. Instead, combine all the regexes into one using the keywords as names for named groups:

patterns = []
for k, v in d.items():  #for each keyword
    patterns.append(f'(?P<{k}>{v[0]}|{v[1]})')
pattern = '(?b)('   '|'.join(patterns)   '){s<=4}'
reSeqs = regex.compile(pattern)

With this, the program can check for which named group was matched in order to get the keyword. You can replace the loop over regex matching with a loop over dictionary items (which could be implemented as a comprehension):

if matched := reSeqs.search(line):
    try:
        keyword = [kw for kw, val in matched.groupdict().items() if val][0]
        # perform further processing of keyword
    except: # no match
        pass

If you need further optimizations, have memory to burn and can sacrifice a little readability, also test whether replacing the loop over lines with a comprehension and regex.Pattern.finditer is more performant:

with open(args.file) as f:
    contents = f.read() # slurp entire file
    matches = [{
          val:kw for kw, val in found.groupdict().items() if val
        } for found in reSeqs.finditer(contents)
      ]

Note that the finditer solution doesn't distinguish between repetitions of a given sequence, so I didn't bother combining the dicts. You could merge entries with the same keys, or, if repetitions should be treated as a single instance, you can merge the dictionaries as-is. If you want to distinguish separate instances of a matched sequence, include file position information in the keys:

matches = [{
      (val, found.span()):kw for kw, val in found.groupdict().items() if val
    } for found in reSeqs.finditer(contents)
  ]
results = {}
for match in matches:
    results.update(match)
# or:
#results = {k:v for d in matches for k,v in d.items()}

If memory is an issue, another option would be to break up the file into chunks ending on line breaks (either line-based, or by reading blocks and separating partial lines at block ends) and use finditer on each chunk:

# implementation of `chunks` left as exercise
def file_chunks(path, size=2**12):
    with open(path) as file:
        yield from chunks(file, size=size)

results = {}
for block in file_chunks(args.file, size=2**20):
    for found in reSeqs.finditer(block):
        results.update({
          (val, found.span()):kw for kw, val in found.groupdict().items() if val
        })