I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.
What I am trying to do
I am fuzzy searching for keywords
, stored in a dictionary d
(with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings
, stored in a file testFile
(one string each line, ~150 characters each) - 4 mismatches max.
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords
How I do it
For this, I first compile my regex and store them in a dictionary compd
. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.
Here is how I am doing it:
#/usr/bin/env python3
import argparse
import regex
parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()
#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}
#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []} #to store regex
for k, v in d.items(): #for each keyword
compd[k].append(regex.compile(r'(?b)(' v[0] '){s<=4}')) #compile 1st elt of list
compd[k].append(regex.compile(r'(?b)(' v[1] '){s<=4}')) #compile second
#Search keywords
with open(args.file) as f: #open file with strings
line = f.readline() #first line/string
while line: #go through each line
for k, v in compd.items(): #for each keyword (ID, regex)
for val in [v[0], v[1]]: #for each elt of list
found = val.search(line) #regex search
if found != None: #if match
print("Keyword " k " found as " found[0]) #print match
if val == v[0]: #if 1st elt of list
break #don't search 2nd
line = f.readline() #next line
I have tested the script using the testFile
:
AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT
and get the following expected result:
Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC
Efficiency
With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.
CodePudding user response:
Having a separate regex for each keyword requires running a match against each regex separately. Instead, combine all the regexes into one using the keywords as names for named groups:
patterns = []
for k, v in d.items(): #for each keyword
patterns.append(f'(?P<{k}>{v[0]}|{v[1]})')
pattern = '(?b)(' '|'.join(patterns) '){s<=4}'
reSeqs = regex.compile(pattern)
With this, the program can check for which named group was matched in order to get the keyword. You can replace the loop over regex matching with a loop over dictionary items (which could be implemented as a comprehension):
if matched := reSeqs.search(line):
try:
keyword = [kw for kw, val in matched.groupdict().items() if val][0]
# perform further processing of keyword
except: # no match
pass
If you need further optimizations, have memory to burn and can sacrifice a little readability, also test whether replacing the loop over lines with a comprehension and regex.Pattern.finditer
is more performant:
with open(args.file) as f:
contents = f.read() # slurp entire file
matches = [{
val:kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
Note that the finditer
solution doesn't distinguish between repetitions of a given sequence, so I didn't bother combining the dicts. You could merge entries with the same keys, or, if repetitions should be treated as a single instance, you can merge the dictionaries as-is. If you want to distinguish separate instances of a matched sequence, include file position information in the keys:
matches = [{
(val, found.span()):kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
results = {}
for match in matches:
results.update(match)
# or:
#results = {k:v for d in matches for k,v in d.items()}
If memory is an issue, another option would be to break up the file into chunks ending on line breaks (either line-based, or by reading blocks and separating partial lines at block ends) and use finditer
on each chunk:
# implementation of `chunks` left as exercise
def file_chunks(path, size=2**12):
with open(path) as file:
yield from chunks(file, size=size)
results = {}
for block in file_chunks(args.file, size=2**20):
for found in reSeqs.finditer(block):
results.update({
(val, found.span()):kw for kw, val in found.groupdict().items() if val
})