Here is an example of the problem with the result list as what I am aiming to obtain.
states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."] text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
result = [Montana, Montana, New York]
One crude way I am looking at this is to do an intersection of the two but it is unable to handle duplicates and two word states like "New York".
state_lower = [x.lower() for x in states]
set(state_lower).intersection(text.lower().split())
I am looking for the fastest way to perform this operation as each text can be very long (4,000 words) and I have millions of texts to go through. Also, I would like to keep the spaces in the original text. Thank you in advance.
CodePudding user response:
looking for the fastest way to perform this operation
Due to this I suggest giving a try flashtext, you need to install it, which is done in standard way
pip install flashtext
Simple usage example with your data
from flashtext import KeywordProcessor
states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(states)
result = keyword_processor.extract_keywords(text)
print(result)
output
['Montana', 'Montana', 'New York']
If you want to know how flashtext is working read Replace or Retrieve Keywords In Documents at Scale by Vikash Singh
CodePudding user response:
The simplest in my opinion is to use a well crafted regex:
import re
regex = '|'.join(map(re.escape, states))
out = re.findall(regex, text)
output: ['Montana', 'Montana', 'New York']
If you want to count:
import re
from collections import Counter
regex = '|'.join(map(re.escape, states))
out = Counter((m.group() for m in re.finditer(regex, text)))
print(dict(out))
output:
{'Montana': 2, 'New York': 1}
CodePudding user response:
states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
for x in states:
print(text.count(x),x)
2 Montana
1 New York
0 Iowa
0 Alabama
0 Washington D.C.
CodePudding user response:
Simply loop over all the strings in your states list and append the matches to a new list.
states = ["Montana", "New York", "Iowa", "Alabama", "Washington D.C."]
text = "Montana is big sky country where great ski slopes can be found. Avid skiers will enjoy Montana more than New York."
for state in states:
if state.lower() in text.lower():
for i in range(text.count(state))
new_list.append(state)
print(new_list)
outputs:
["Montana", "Montana", "New York"]
CodePudding user response:
You could mabe create little function for it.
wanted_countries = ['Montana', 'New York', 'Iowa', 'Alabama', 'Washington D.C.']
def filter_non_wanted(my_list):
return_list = list()
for INDEX, i in enumrate(my_list.split(' ')):
if i == 'New' and my_list[INDEX 1] == 'York':
return_list.append('New York')
elif i in wanted_countries:
return_list.append(i)
#-- second possibility, more flexible --#
if ' '.join([i, my_list[INDEX 1]]) in wanted_countries:
return_list.append(' '.join([i, my_list[INDEX 1]]))
elif i in wanted_countries:
return_list.append(i)
return return_list
The second one is more "flexible" because you can simple add new if
with like
if ' '.join(list(np.append([i], my_list[from:to 1]))) in wanted_countries:
.