what is the fast way to match words in text?-CodePudding

i have a list of regex like :

regex_list = [". rive. ",". ll","[0-9]  blue car. "......] ## list of length 3000

what is the best method to match all this regex to my text

for example :

text : Hello, Owning 2 blue cars for a single driver

so in the output , i want to have a list of matched words :

matched_words = ["Hello","4 blue cars","driver"]  ##Hello <==>. llo

CodePudding user response：

Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of . , which states that there may follow any character any amount of time. What I have done here is the following:

import re

regex_list = [".rive.",". ll.","[0-9]  blue car."]
text = "Hello, Owning 2 blue cars for a single driver"

# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]

# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()

# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]

print(matching_texts)

I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following

['Hello', '2 blue cars', 'driver']

NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.

CodePudding user response：

You could also try this which is multi threaded version of @Lexpj answer

from concurrent.futures import ThreadPoolExecutor, as_completed
import re


# list of length 3000
regex_list = [".rive.", ". ll.", "[0-9]  blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "


def test(text, regex):
    # Returns all the spans of matched regex items in text
    spans = [re.search(regex, text).span()]

    # Sorts the spans on first occurence (so, first element in item for every item in span).
    spans.sort()

    # Retrieves the text via index of spans in text.
    matching_texts = [text[x[0]:x[1]] for x in spans]
    return matching_texts


with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(test, my_string, regex)
               for regex in regex_list}

    # as_completed() gives you the threads once finished
    matched = set()
    for f in as_completed(futures):
        # Get the results
        rs = f.result()
        matched = matched.union(set(rs))
    print(matched)