Home > other >  Efficient way to use regex compile (Python) with a list of 10.000 strings
Efficient way to use regex compile (Python) with a list of 10.000 strings

Time:11-09

I got a list which contains approx. 10.000 strings and I want to use a regex pattern to detect this in this list. When I use re.compile it takes a lot of time to only apply one regex pattern. Is there any way with Python to make it faster?

Here my code:


import re

list_of_strings = ["I like to eat meat", "I don't like to eat meat", "I like to eat fish", "I don't like to eat fish"]

outcome = [x for x in list_of_strings if len(re.compile(r"I like to eat (.*?)").findall(x)) != 0]

Out[6]: ['I like to eat meat', 'I like to eat fish'] 

Here I have just 4 strings to demonstrate the case. In reality the code should handle 10.000 strings.

I could also use multiple processing to solve this issue but maybe there is also another solution with pytorch, pyspark or other Frameworks existing.

CodePudding user response:

re.compile is designed to be used only once. Compile once then use the compiled regex that is more efficient.

import re

pattern = re.compile(r"I like to eat (.*?)")
list_of_strings = ["I like to eat meat", "I don't like to eat meat", "I like to eat fish", "I don't like to eat fish"]

outcome = [x for x in list_of_strings if pattern.match(x)]

Your example is a good one to illustrate the use of re.compile(), i.e. when you use the regex intensively.

CodePudding user response:

You may also consider looping the list.

new_list = []
for item in list_of_strings:
    if 'I like to eat' in item:
        new_list.append(item)
  • Related