Home > database >  Find the longest common substring suffix between a string and a key in a dictionary
Find the longest common substring suffix between a string and a key in a dictionary

Time:12-03

i'm trying to make a program that ouputs the longest common suffix string between a string and a key inside a dictionary.

Easy example: The dictionary has about 6000 key:value pairs so i won't include the whole dictionary. For information the key length are from 2 up to 7 characters.

codeCountry = {
    'AFHAS': 'AFGHANISTAN',
    'AXUYFF': 'ÅLAND ISLANDS',
    'ALUU': 'ALBANIA',
    'DZBG': 'ALGERIA',
    'ASSQ': 'AMERICAN SAMOA',
    'ADDD': 'ANDORRA',
    'ANGO': 'ANGOLA',
    'ANGI': 'ANGUILLA',
    'AQ': 'ANTARCTICA',
    'AG': 'ANTIGUA AND BARBUDA',
    'AMENI': 'ARMENIA',
    'AURI': 'ARUBA',
    'AUR': 'ARGENTINA',
    'AURII': 'AUSTRALIA'
     ...

}

As string i will take "AMAURI" as example so it's more clear (the string is generated randomly and has variable length from one character up to 16, but it always contains one of the suffixes (keys) from the dictionary):

strToUse = "AMAURI"

Expected Result: "ARUBA" because the longest common suffix between the string and the keys in the dictionary is "AURI" so -> "AURI":"ARUBA".

How can i do this is python? I tried something like this (I'm new to python):

for country in codeCountry:
 if country in strToUse:
   print(codeCountry.get(country))

But this prints me "ARGENTINA" which isn't correct, i don't understand why exactly. There are similar problems here on stackoverflow but my problem is different in the sense that it looks for the suffix and not just any character inside the string. I hope i was clear, i'm really confused myself and don't know how to do it, can anybody help me please? Or atleast point me in the right direction?

CodePudding user response:

You can sort the keys by length first and then check them

strToUse = "AMAURI"
for country in sorted(codeCountry.keys(),key=len,reverse=True):
    if country in strToUse:
        print(codeCountry.get(country))
        break
ARUBA

CodePudding user response:

Try out the code below and see if it works for you. stringSubsets() returns a set of all possible keys (country codes) that could be constructed from your input string ("AMAURI" in your example). Set intersection is then used on the codeCountry dict to provide all the keys that match a substring in the set returned by stringSubsets(). The print statement in the last line shows how you extract the value of the largest matching key, or return None if not key matches to avoid a key error.

If for some reason your input strings (in this case "AMAURI") are excessively long and you need to speed your code up, then you might be able to use something more advanced like the Aho Corasick algorithm. If you go this route, you might be able to invert your methodology and actually search your input string for the longest key in your dict (vs searching dict for a substring). This could work well because your codeCountry dict probably won't change often, so the trie that Aho Corasick uses to function could be built ahead of time using your dict keys, making your search on the input string very fast.

codeCountry = {
'AFHAS': 'AFGHANISTAN',
'AXUYFF': 'ÅLAND ISLANDS',
'ALUU': 'ALBANIA',
'DZBG': 'ALGERIA',
'ASSQ': 'AMERICAN SAMOA',
'ADDD': 'ANDORRA',
'ANGO': 'ANGOLA',
'ANGI': 'ANGUILLA',
'AQ': 'ANTARCTICA',
'AG': 'ANTIGUA AND BARBUDA',
'AMENI': 'ARMENIA',
'AURI': 'ARUBA',
'AUR': 'ARGENTINA',
'AURII': 'AUSTRALIA'
}

def stringSubsets(s):
    out = set()
    for i in range(len(s)):
        for j in range(i 1, len(s) 1):
            out.add(s[i:j])

    return out

code = "AMAURI"
candidates = stringSubsets(code)
keys = candidates.intersection(codeCountry)

# results in None if no substring matches a key in dict, else give the  
# value of the longest matching key
print(None if not keys else codeCountry[max(keys)])
  • Related