Home > database >  Returning the frequency of word list matched to a string
Returning the frequency of word list matched to a string

Time:11-09

Here is an example of the problem with the result list as what I am aiming to obtain.

states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."] text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."

result = [Montana, Montana, New York]

One crude way I am looking at this is to do an intersection of the two but it is unable to handle duplicates and two word states like "New York".

state_lower = [x.lower() for x in states]
set(state_lower).intersection(text.lower().split())

I am looking for the fastest way to perform this operation as each text can be very long (4,000 words) and I have millions of texts to go through. Also, I would like to keep the spaces in the original text. Thank you in advance.

CodePudding user response:

looking for the fastest way to perform this operation

Due to this I suggest giving a try flashtext, you need to install it, which is done in standard way

pip install flashtext

Simple usage example with your data

from flashtext import KeywordProcessor
states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(states)
result = keyword_processor.extract_keywords(text)
print(result)

output

['Montana', 'Montana', 'New York']

If you want to know how flashtext is working read Replace or Retrieve Keywords In Documents at Scale by Vikash Singh

CodePudding user response:

The simplest in my opinion is to use a well crafted regex:

import re

regex = '|'.join(map(re.escape, states))

out = re.findall(regex, text)

output: ['Montana', 'Montana', 'New York']

If you want to count:

import re
from collections import Counter

regex = '|'.join(map(re.escape, states))

out = Counter((m.group() for m in re.finditer(regex, text)))
print(dict(out))

output:

{'Montana': 2, 'New York': 1}

CodePudding user response:

states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
for x in states:
    print(text.count(x),x)
2 Montana
1 New York
0 Iowa
0 Alabama
0 Washington D.C.

CodePudding user response:

Simply loop over all the strings in your states list and append the matches to a new list.

states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."

for state in states:
    if state.lower() in text.lower():
        for i in range(text.count(state))
            new_list.append(state)

print(new_list)

outputs:

["Montana", "Montana", "New York"]

CodePudding user response:

You could mabe create little function for it.

wanted_countries = ['Montana', 'New York', 'Iowa', 'Alabama', 'Washington D.C.']

def filter_non_wanted(my_list):
   return_list = list()

   for INDEX, i in enumrate(my_list.split(' ')):
      if i == 'New' and my_list[INDEX 1] == 'York':
         return_list.append('New York')  
      elif i in wanted_countries:
         return_list.append(i)

      #-- second possibility, more flexible --#

      if ' '.join([i, my_list[INDEX 1]]) in wanted_countries:
         return_list.append(' '.join([i, my_list[INDEX 1]]))
      elif i in wanted_countries:
         return_list.append(i)
      return return_list

The second one is more "flexible" because you can simple add new if with like

if ' '.join(list(np.append([i], my_list[from:to 1]))) in wanted_countries:.

  • Related