i have a list of regex like :
regex_list = [". rive. ",". ll","[0-9] blue car. "......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>. llo
CodePudding user response:
Alright, first of all, you will probably want to adjust your regex_list
, because of now, matching those strings will give you the entire text back as match. This is because of .
, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",". ll.","[0-9] blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list
slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.
CodePudding user response:
You could also try this which is multi threaded version of @Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ". ll.", "[0-9] blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)