how to find exact match and 5 vocab before and after it in python-CodePudding

I have below dataframe in python,

Text = provide written informed consent healthy male or female age between 31 to 59 years fluent in german language

it needs looking fore age and add 5 vocab before and after that word. target value = age my desired output:

result = healthy male or female age between 31 to 59 years

my code:

 Text = "provide written informed consent healthy male or female age between 31 to 59 years fluent in german language"
 r1 = re.search(r"(?:[a-zA-Z'-] [^a-zA-Z'-] ){0,3} age (?:[^a-zA-Z'-] [a-zA-Z'-] ){0,3}", text)
 r1.group()

my result is

 age 16 years old

my data has some words like manage or agent that should be ignore.

thanks

CodePudding user response：

One way to do so, without using regex, might be to split the text into words and retrieve the position of age in the word list.

Text = "provide written informed consent healthy male or female age between 31 to 59 years fluent in german language"
Text = Text.split()

result = Text[Text.index("age") - 4:Text.index("age")   5]
print(result)  # ['healthy', 'male', 'or', 'female', 'age', 'between', '31', 'to', '59']

CodePudding user response：

Another solution that will find all occurrences if multiple occurrences of a given token (e.g "age") occur. Additionally it will keep the match even if there are not 5 tokens before or after it because the string does not contain as many tokens. It is unclear from your question what should happen in that case and as others excluded that I thought I'd provide a solution to include this in case that's required.

# one age
text1 = "provide written informed consent healthy male or female age between 31 to 59 years fluent in german language"
# two age
text2 = "provide written informed consent another age appeared in healthy male or female age between 31 to 59 years fluent in german language"
# age at beginning
text3 = "age at beginning provide written informed consent healthy male or female age between 31 to 59 years fluent in german language"
# age at end
text4 = "provide written informed consent healthy male or female age between 31 to 59 years fluent in german language age at the end"


def find_token(text, sep, search_token, l_from_token, r_from_token):
    """
    Find all occurrences of a token in a string with specified amount of tokens before and after token.

    :param text: text to search for given token
    :param sep: separator
    :param search_token: token to search for
    :param l_from_token: how many tokens left from token to include in result
    :param r_from_token: how many tokens right from token to include in result
    :return: all occurrences of token in text with specified amount of tokens before and after token
    """
    tokens = text.split(sep)
    matches = list()
    for i, token in enumerate(tokens):
        if token != search_token:
            continue
        start = max(i - l_from_token, 0)
        end = min(i   r_from_token   1, len(tokens))
        matches.append(sep.join(tokens[start:end]))
    return matches


search_for = "age"
separator = " "
left_from_token = 4
right_from_token = 5

print(find_token(text1, separator, search_for, left_from_token, right_from_token))
print(find_token(text2, separator, search_for, left_from_token, right_from_token))
print(find_token(text3, separator, search_for, left_from_token, right_from_token))
print(find_token(text4, separator, search_for, left_from_token, right_from_token))

Expected output:

['healthy male or female age between 31 to 59 years']
['written informed consent another age appeared in healthy male or', 'healthy male or female age between 31 to 59 years']
['age at beginning provide written informed', 'healthy male or female age between 31 to 59 years']
['healthy male or female age between 31 to 59 years', 'fluent in german language age at the end']