I'm working on matching a list of regular expressions with a list of strings. The problem is, that the lists are very big (RegEx about 1 million, strings about 50T). What I've got so far is this:
reg_list = ["domain\.com\/picture\.png", "entry{0,9}"]
y = ["test","string","entry4also_found","entry5"]
for r in reg_list:
for x in y:
if re.findall(r, x):
RESULT_LIST.append(x)
print(x)
Which works very well logically but is way to unefficient for those number of entries. Is there a better (more efficient) solution for this?
Thanks in advance.
CodePudding user response:
Use any()
to test if any of the regular expressions match, rather than looping over the entire list.
Compile all the regular expressions first, so this doesn't have to be done repeatedly.
reg_list = [re.compile(rx) for rx in reg_list]
for word in y:
if (any(rx.search(word) for rx in reg_list):
RESULT_LIST.append(word)
CodePudding user response:
python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop
$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop
So, if you are going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).
CodePudding user response:
The only enhancements that come to mind are
- Stopping match at first occurrence as
re.findall
attempts to search for multiple matches, this is not what you are after - Pre-compiling your regexes.
reg_list = [r"domain\.com/picture\.png", r"entry{0,9}"]
reg_list = [re.compile(x) for x in reg_list] # Step 1
y = ["test","string","entry4also_found","entry5"]
RESULT_LIST = []
for r in reg_list:
for x in y:
if r.search(x): # Step 2
RESULT_LIST.append(x)
print(x)