I am trying to find the average value of numbers before certain words.
I have a list of sentences:
['I had to wait 30 minutes', 'It took too long had to wait 35 minutes', ...]
I want to find the average value of the numbers before a certain word which in this case is minutes.
So this would result in 32.5 minutes. And I want to be able to do this for any input word. I already found which words most often occur after a number, but I did that by change all number to the same value(@) and seeing what words most frequently occur after the @ sign.
I thought I could maybe create a bigram and then look for the number before minutes, but that does not work right now.
unigrams = (
all_data['PreProcess'].str.lower()
.str.split(expand=True)
.stack())
from nltk import bigrams
bgs = bigrams(unigrams)
lake_bgs = filter(lambda item: item[0] == 'minutes', bgs)
from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print (c.most_common(12))
CodePudding user response:
Use str.extractall
to get the minutes, convert to numeric and then take the mean...
average = pd.to_numeric(df['PreProcess'].str.extractall(r'(?i)(\d )\s minutes').squeeze()).mean()