I have a list of regex which are assigned a keyword for identification. I will be comparing a list of strings against this list of regex. If any of the pattern matches, I want to identify which regex matched in an efficient way.
regexes = {'tag1' : 'regex1', 'tag2' : 'regex2', 'tag3' : '^[a-z] \.com'}
Option 1 :
for k,v in regexes.items():
s = re.findall(v, "google.com")
if len(s) != 0:
print("Found match with tag : ", k)
Option 2
combined_regex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes.values()))
print(combined_regex.findall("google.com"))
Problem :
Option 2 would identify if any of the pattern matches. Is it also possible to know which pattern has matched from the combined regex?
CodePudding user response:
If you're concerned about efficiency, compile the regexps once-and-for-all, and don't use findall()
. If you only care whether there's a match. then just use .search()
- there's no need to build a list of all matches in that case.
I'd also invert the dict, mapping compiled regexp objects to tags instead:
import re
p2tag = {re.compile('regex1') : 'tag1',
re.compile('regex2') : 'tag2',
re.compile('^[a-z] \.com') : 'tag3'}
for s in ['aregex1', 'bregex2k', 'blah.com123', 'hopeless']:
for p in p2tag:
if m := p.search(s):
print(repr(s), "matched by", repr(p2tag[p]), m)
break
else:
print("no match for", repr(s))
which displays:
'aregex1' matched by 'tag1' <re.Match object; span=(1, 7), match='regex1'>
'bregex2k' matched by 'tag2' <re.Match object; span=(1, 7), match='regex2'>
'blah.com123' matched by 'tag3' <re.Match object; span=(0, 8), match='blah.com'>
no match for 'hopeless'
EDIT: I'll add that there is a way to find which groups matched, and that can be abused to find which of your regexps matched when squashed into a single regexp. But you need to use capturing groups for this. Here I'll add "xxxx" as a temporary prefix for your tag names to build group names, but there's no protection against conflicts with named groups with the same names in the input regexps. Continuing from the above,
pieces = []
for (p, tag) in p2tag.items():
pieces.append(f"(?P<xxxx{tag}>{p.pattern})")
fatre = "|".join(pieces)
print(fatre)
searcher = re.compile(fatre).search
for s in ['aregex1', 'bregex2k', 'blah.com123', 'hopeless']:
if m := searcher(s):
assert m.lastgroup.startswith("xxxx")
print(repr(s), "matched by", repr(m.lastgroup[4:]))
else:
print("no match for", repr(s))
displays:
(?P<xxxxtag1>regex1)|(?P<xxxxtag2>regex2)|(?P<xxxxtag3>^[a-z] \.com)
'aregex1' matched by 'tag1'
'bregex2k' matched by 'tag2'
'blah.com123' matched by 'tag3'
no match for 'hopeless'
This all builds on the .lastgroup
attribute of a match object, which gives the name of the last group that matched.
I don't much like it. But, I haven't timed it, and if it turned out to be much faster in a context where that mattered, I'd use it ;-)