Let me start with an example. Consider the following list in python
cities = [
'New york'
'San francisco',
'California',
'Las vegas',
'Chicago',
'Miami'
]
I also have following sentences.
sentences = [
"Both of us were new to New York City, and had few or no friends.",
"Win three more games and he becomes king of San Francisco.",
"Uncurling from the couch, she started to the bedroom of her father's small Miami apartment."
]
For each sentence, find out substring in this sentence that is closest to any string present in the list. So, in this example, I want to get the longest substring for each sentence in sentences
which is closest to any string in the list cities
. Therefore, in this case, my result should look like:
desired_result = [
'New york',
'San fransisco',
'Miami'
]
There are few algorithms that are in my mind, but they are not ideal.
Algorithm 1
One algorithm can give very good results, but it is very bad in terms of time complexity. I tried to extract all sub phrases of a sentence, starting from n
word sub-phrase to one word sub-phrase for a sentence with n tokens. Then I use difflib.get_close_matches
function to detect any sub-phrase that is closest to any string in the list of cities
. However, the complexity as we can clearly see is very high. For sentence of length n
, we have total O(n*n)
sub-phrases. Also, the list of cities is not small. In my real-world use-case this list contains around 7 million strings.
In this case, my code looks like this:
def generate_subphrases(sen):
subphrases: List[str] = []
# My logic to generate all possible subphrases
# .
# .
# .
return subphrases
result = []
for sen in sentences:
subphrases = generate_subphrases(sen)
ans = None
for phrase in subphrases:
if get_close_matches(phrase, cities):
ans = phrase
break
result.append(ans)
print(result)
Algorithm 2
This is bit faster compared to previous approach, however, this is not as good as last one. The benefit of using last approach was that we could tolerate few mismatches in that approach. For example, New York
would be detected if the cities
list even contained New york
. However, in this case we do not tolerate even single character mismatch. In my use-case, I can tolerate error upto 30-35% in terms of character mismatch. In this approach, I form huge regular expression with union of all cities in the list. Then I use re.search
to search for the sub-phrase in my sentence. In my opinion, this is faster but not very good.
I want to know if I can use any data structure to achieve this task, or any python utility function similar to difflib.get_close_matches
that can allow searching over entire sentence.
Update
My ultimate goal is to make the algorithm 1 more efficient, may be using some string algorithm I might not be familiar with or probably some data structure. I also thought of `Trie` data structure once, but again, this is helpful for exact match and not the soft matching like the python utility function described in algorithm 1 provides.
Note: I'm not doing NER task in this case. The example provided was just to illustrate things easily. For this reason, I can not use pre-trained machine learning models like Spacy or NLTK to identify cities. The overall goal is not to identify cities, but to identify sub-phrase in a string that is closest to any string from a list of strings
CodePudding user response:
If you can create the possible matching strings before running the algorithm, pyahocorasick would be a perfect solution for your use-case since it precomputes a trie with all the cities you're trying to match.
The downside is you'd need to provide the variations / possible character mismatch patterns.
For your naive Algorithm 1, I'd recommend only returning subphrases of size up to M where M is the longest token in the list of strings you have. (There's no point trying to match a 10 word subsentence with string which are only up to 3 words long for example). This should at least help speed things up.