Efficient way to use regex compile (Python) with a list of 10.000 strings-CodePudding

I got a list which contains approx. 10.000 strings and I want to use a regex pattern to detect this in this list. When I use re.compile it takes a lot of time to only apply one regex pattern. Is there any way with Python to make it faster?

Here my code:


import re

list_of_strings = ["I like to eat meat", "I don't like to eat meat", "I like to eat fish", "I don't like to eat fish"]

outcome = [x for x in list_of_strings if len(re.compile(r"I like to eat (.*?)").findall(x)) != 0]

Out[6]: ['I like to eat meat', 'I like to eat fish']

Here I have just 4 strings to demonstrate the case. In reality the code should handle 10.000 strings.

I could also use multiple processing to solve this issue but maybe there is also another solution with pytorch, pyspark or other Frameworks existing.

CodePudding user response：

re.compile is designed to be used only once. Compile once then use the compiled regex that is more efficient.

import re

pattern = re.compile(r"I like to eat (.*?)")
list_of_strings = ["I like to eat meat", "I don't like to eat meat", "I like to eat fish", "I don't like to eat fish"]

outcome = [x for x in list_of_strings if pattern.match(x)]

Your example is a good one to illustrate the use of re.compile(), i.e. when you use the regex intensively.

CodePudding user response：

You may also consider looping the list.

new_list = []
for item in list_of_strings:
    if 'I like to eat' in item:
        new_list.append(item)