I currently have a function that yields a term and the sentence it occurs in. At this point, the function is only retrieving the first match from the list of terms. I would like to be able to retrieve all matches instead of just the first.
For example, the list_of_matches = ["heart attack", "cardiovascular", "hypoxia"]
and a sentence would be text_list = ["A heart attack is a result of cardiovascular...", "Chronic intermittent hypoxia is the..."]
The ideal output is:
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
# this is the current function
def find_word(list_of_matches, line):
for words in list_of_matches:
if any([words in line]):
return words, line
# returns list of 'term, matched string'
key_vals = [list(find_word(list_of_matches, line.lower())) for line in text_list if
find_word(list_of_matches, line.lower()) != None]
# output is currently
['heart attack', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
CodePudding user response:
You're going to want to use regex here.
import re
def find_all_matches(words_to_search, text):
matches = []
for word in words_to_search:
matched_text = re.search(word, text).group()
matches.append(matched_text)
return [matches, text]
Please note that this will return a nested list for all the matches.
CodePudding user response:
The solution needs 2 steps:
- fix the function
- process the output
Given that your disired output follows the pattern
output = [ [word1, sentence1], [word2, sentence1], [word3, sentence2], ]
- Fix the function: you should change de return on 'for' loop to iterate over each word of list_of_matches, to get all words that matches and not only the first
. It should stay like this:
def find_word(list_of_matches, line): answer = [] for words in list_of_matches: if any([words in line]): answer.append([words, line]) return answer
With the function above, the output will be:
key_vals = [ [ ['heart attack', 'a heart attack is a result of cardiovascular...'], ['cardiovascular', 'a heart attack is a result of cardiovascular...'] ], [ ['hypoxia', 'chronic intermittent hypoxia is the...'] ] ]
- Process the output: Now you need to get the var "key_vals" and process all the list of lists for each sentence processed with the following code:
output = [] for word_sentence_list in key_vals: for word_sentence in word_sentence_list: output.append(word_sentence)
and, finally, the output will be:
output = [ ['heart attack', 'a heart attack is a result of cardiovascular...'], ['cardiovascular', 'a heart attack is a result of cardiovascular...'], ['hypoxia', 'chronic intermittent hypoxia is the...'] ]