Replace two letter state code with full name in string

I have a list of ~60K strings, all looking like this:

strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']

I also have a dict of lookups. For an MRE:

lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}

What I would like to do, is loop through my list of strings, and replace the two-digit state code with the value in the dictionary, making the resulting list look like:

new_strings = ['corpus christi texas', 'san angelo', 'oklahoma city oklahoma', 'abilenesweetwater']

I have seen many similar questions that are looking to do this where the two-digit state code (or full state name) is a column of a pd.Dataframe, but not as an independent string. I am assuming I will need a regex.

I have tried the following:

print("Test", 'corpus christi tx')
new_test_str = re.sub(r'[\s tx \s]', 'texas', 'corpus christi tx')
print("Reply", new_test_str)

Which (incorrectly) yields:

Test corpus christi tx
Reply corpustexaschristexasitexastexastexas

CodePudding user response：

You can create a regex from the dictionary keys to match them as whole words, and get values from the dictionary once the match is found and replace with that value:

import re
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
rx =  re.compile(fr'\b(?:{"|".join([key for key in lookup])})\b')
strings = [rx.sub(lambda x: lookup[x.group()], s) for s in strings]

Output:

>>> strings
['corpus christi texas', 'san angelo', 'oklahoma city oklahoma', 'abilenesweetwater']

See the online Python demo.

CodePudding user response：

Here is another approach to replace the strings. Assuming that the state code is always in the end of the string, you write a regex pattern to find such strings, and replace it with the value in your lookup dictionary.

import re    
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
_patt = re.compile(r'\s(\w{2})$')

strings = [re.sub(_patt.search(s).group(1), lookup.get(_patt.search(s).group(1)), s) if _patt.search(s) else s for s in strings]

Output:

['corpus christi texas', 'san angelo', 'oklahomalahoma city oklahoma', 'abilenesweetwater']

To understand this code in detail, here is the easier implementation of the same code:

for each in strings:
    match = _patt.search(each)
    if match:
        each = re.sub(match.group(1), lookup.get(match.group(1)), each)
        print(each)