Home > OS >  How to check if list of word are in string and they are near each other? [Python]
How to check if list of word are in string and they are near each other? [Python]

Time:10-19

I have a string and a list of words:

string ="""Ventilation box with reference VE03 with soundproofed box with inspection door
with the following technical characteristics: Air flow: 250 l s Available static pressure:
200 Pa With voltage regulator With characteristics according to project technical data
sheets Model make: CVB 4 180 180N or equivalent Includes flexible anti-vibration tarpaulins
at the air connections and metal dampers and supports Includes cable."""

list1 = ["CVB","1100","250"]
list2 = ["CVB","4","180","180N","RE","147W"]

I want to check if the string has 2 or more words from the list and if they are near each other (for example 5 positions/words before and after). Using 'list1' has to be false cause 'CVB' and '250' aren't together, but using 'list2' should return true ("CVB","4","180" and "180N" are together).

My actually function only detects if has the word in the string:

count = 0
for word in list1: 
        if len(re.findall("(?<!\S)"   word   "(?!\S)", string)) > 0:
            count =1
print(count)

CodePudding user response:

My suggestion:

def my_function(my_string, list1):
    list_to_search_in = my_string.split()
    indices = []
    for word in list1:
        if word in list_to_search_in:
            indices.append(list_to_search_in.index(word))
    for index1 in indices:
        for index2 in indices:
            if abs(index1 - index2) <= 5:
                return True
    return False

Also check your naming conventions and avoid things like string or list as variable names.

CodePudding user response:

You could consider utilizing collections.defaultdict:

from collections import defaultdict
from itertools import combinations


def any_close_words(text: str, words: list[str], dist: int) -> bool:
    words = set(words)
    word_to_positions = defaultdict(list)
    for index, word in enumerate(text.split()):  # Could consider using a nltk tokenizer...
        if word in words:
            word_to_positions[word].append(index)
    for w1, w2 in combinations(word_to_positions, r=2):
        for a in word_to_positions[w1]:
            for b in word_to_positions[w2]:
                if abs(a - b) <= dist:
                    return True
    return False


def main() -> None:
    string = """Ventilation box with reference VE03 with soundproofed box with inspection door
with the following technical characteristics: Air flow: 250 l s Available static pressure:
200 Pa With voltage regulator With characteristics according to project technical data
sheets Model make: CVB 4 180 180N or equivalent Includes flexible anti-vibration tarpaulins
at the air connections and metal dampers and supports Includes cable."""
    print(f'{any_close_words(string, ["CVB", "1100", "250"], 5) = }')
    print(f'{any_close_words(string ["CVB", "4", "180", "180N", "RE", "147W"], 5) = }')


if __name__ == '__main__':
    main()

Output:

any_close_words(string, ["CVB", "1100", "250"], 5) = False
any_close_words(string, ["CVB", "4", "180", "180N", "RE", "147W"], 5) = True
  • Related