I have a list file contents that contains lists consisting of word tokens as its elements. I would like to create a function that takes two strings of length one and size as its input and returns instances where these two terms appear within size words of each other.
I have so far tokenized the words for each element in the list using nltk but I'm not sure where to go from here, could anyone refer me to an nltk method/python code that could do this?
Here is the what the function should look like
file_contents = [['man', 'once', 'upon', 'time', 'love', 'princess'], ['python', 'code', 'cool', 'uses, 'java'],['man', 'help', 'test', 'weird', 'love'], .............]
def check_words_within(string: word1, string:word2, int: size) -> list:
#how to implement?
check_words_within('man','love', 4)
would return [[man', 'once', 'upon', 'time', 'love'],['man', 'help', 'test', 'weird', 'love']]
check_words_within('man','upon', 1)
would return [['man', 'once', 'upon']]
check_words_within('man','document',4)
would return []
Does nltk
have a function to help me do this?
CodePudding user response:
Create a list of dicts to look up values from.
dat = [{ind: val for val, ind in enumerate(el)} for el in file_contents]
def foo(w1, w2, dist, f, fdat):
arr = []
for i, v in enumerate(fdat):
i1 = v.get(w1)
i2 = v.get(w2)
if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist 1):
arr.append(f[i][i1:i2 1])
return arr
foo("man", "upon", 1, file_contents, dat)
# [['man', 'once', 'upon']]