Returning the frequency of word list matched to a string-CodePudding

Here is an example of the problem with the result list as what I am aiming to obtain.

states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."] text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."

result = [Montana, Montana, New York]

One crude way I am looking at this is to do an intersection of the two but it is unable to handle duplicates and two word states like "New York".

state_lower = [x.lower() for x in states]
set(state_lower).intersection(text.lower().split())

I am looking for the fastest way to perform this operation as each text can be very long (4,000 words) and I have millions of texts to go through. Also, I would like to keep the spaces in the original text. Thank you in advance.

CodePudding user response：

looking for the fastest way to perform this operation

Due to this I suggest giving a try flashtext, you need to install it, which is done in standard way

pip install flashtext

Simple usage example with your data

from flashtext import KeywordProcessor
states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(states)
result = keyword_processor.extract_keywords(text)
print(result)

output

['Montana', 'Montana', 'New York']

If you want to know how flashtext is working read Replace or Retrieve Keywords In Documents at Scale by Vikash Singh

CodePudding user response：

The simplest in my opinion is to use a well crafted regex:

import re

regex = '|'.join(map(re.escape, states))

out = re.findall(regex, text)

output: ['Montana', 'Montana', 'New York']

If you want to count:

import re
from collections import Counter

regex = '|'.join(map(re.escape, states))

out = Counter((m.group() for m in re.finditer(regex, text)))
print(dict(out))

output:

{'Montana': 2, 'New York': 1}

CodePudding user response：

states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
for x in states:
    print(text.count(x),x)
2 Montana
1 New York
0 Iowa
0 Alabama
0 Washington D.C.

CodePudding user response：

Simply loop over all the strings in your states list and append the matches to a new list.

states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."

for state in states:
    if state.lower() in text.lower():
        for i in range(text.count(state))
            new_list.append(state)

print(new_list)

outputs:

["Montana", "Montana", "New York"]

CodePudding user response：

You could mabe create little function for it.

wanted_countries = ['Montana', 'New York', 'Iowa', 'Alabama', 'Washington D.C.']

def filter_non_wanted(my_list):
   return_list = list()

   for INDEX, i in enumrate(my_list.split(' ')):
      if i == 'New' and my_list[INDEX 1] == 'York':
         return_list.append('New York')  
      elif i in wanted_countries:
         return_list.append(i)

      #-- second possibility, more flexible --#

      if ' '.join([i, my_list[INDEX 1]]) in wanted_countries:
         return_list.append(' '.join([i, my_list[INDEX 1]]))
      elif i in wanted_countries:
         return_list.append(i)
      return return_list

The second one is more "flexible" because you can simple add new if with like

if ' '.join(list(np.append([i], my_list[from:to 1]))) in wanted_countries:.