Home > Software engineering >  what is the fast way to match words in text?
what is the fast way to match words in text?

Time:12-12

i have a list of regex like :

regex_list = [". rive. ",". ll","[0-9]  blue car. "......] ## list of length 3000

what is the best method to match all this regex to my text

for example :

text : Hello, Owning 2 blue cars for a single driver 

so in the output , i want to have a list of matched words :

matched_words = ["Hello","4 blue cars","driver"]  ##Hello <==>. llo

CodePudding user response:

Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of . , which states that there may follow any character any amount of time. What I have done here is the following:

import re

regex_list = [".rive.",". ll.","[0-9]  blue car."]
text = "Hello, Owning 2 blue cars for a single driver"

# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]

# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()

# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]

print(matching_texts)

I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following

['Hello', '2 blue cars', 'driver']

NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.

CodePudding user response:

You could also try this which is multi threaded version of @Lexpj answer

from concurrent.futures import ThreadPoolExecutor, as_completed
import re


# list of length 3000
regex_list = [".rive.", ". ll.", "[0-9]  blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "


def test(text, regex):
    # Returns all the spans of matched regex items in text
    spans = [re.search(regex, text).span()]

    # Sorts the spans on first occurence (so, first element in item for every item in span).
    spans.sort()

    # Retrieves the text via index of spans in text.
    matching_texts = [text[x[0]:x[1]] for x in spans]
    return matching_texts


with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(test, my_string, regex)
               for regex in regex_list}

    # as_completed() gives you the threads once finished
    matched = set()
    for f in as_completed(futures):
        # Get the results
        rs = f.result()
        matched = matched.union(set(rs))
    print(matched)
  • Related