I got a list which contains approx. 10.000 strings and I want to use a regex pattern to detect this in this list. When I use re.compile it takes a lot of time to only apply one regex pattern. Is there any way with Python to make it faster?
Here my code:
import re
list_of_strings = ["I like to eat meat", "I don't like to eat meat", "I like to eat fish", "I don't like to eat fish"]
outcome = [x for x in list_of_strings if len(re.compile(r"I like to eat (.*?)").findall(x)) != 0]
Out[6]: ['I like to eat meat', 'I like to eat fish']
Here I have just 4 strings to demonstrate the case. In reality the code should handle 10.000 strings.
I could also use multiple processing to solve this issue but maybe there is also another solution with pytorch, pyspark or other Frameworks existing.
CodePudding user response:
re.compile
is designed to be used only once.
Compile once then use the compiled regex that is more efficient.
import re
pattern = re.compile(r"I like to eat (.*?)")
list_of_strings = ["I like to eat meat", "I don't like to eat meat", "I like to eat fish", "I don't like to eat fish"]
outcome = [x for x in list_of_strings if pattern.match(x)]
Your example is a good one to illustrate the use of re.compile()
, i.e. when you use the regex intensively.
CodePudding user response:
You may also consider looping the list.
new_list = []
for item in list_of_strings:
if 'I like to eat' in item:
new_list.append(item)