Home > Back-end >  Replace two letter state code with full name in string - Python 3.8.x
Replace two letter state code with full name in string - Python 3.8.x

Time:03-18

I have a list of ~60K strings, all looking like this:

strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']

I also have a dict of lookups. For an MRE:

lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}

What I would like to do, is loop through my list of strings, and replace the two-digit state code with the value in the dictionary, making the resulting list look like:

new_strings = ['corpus christi texas', 'san angelo', 'oklahoma city oklahoma', 'abilenesweetwater']

I have seen many similar questions that are looking to do this where the two-digit state code (or full state name) is a column of a pd.Dataframe, but not as an independent string. I am assuming I will need a regex.

I have tried the following:

print("Test", 'corpus christi tx')
new_test_str = re.sub(r'[\s tx \s]', 'texas', 'corpus christi tx')
print("Reply", new_test_str)

Which (incorrectly) yields:

Test corpus christi tx
Reply corpustexaschristexasitexastexastexas

CodePudding user response:

You can create a regex from the dictionary keys to match them as whole words, and get values from the dictionary once the match is found and replace with that value:

import re
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
rx =  re.compile(fr'\b(?:{"|".join([key for key in lookup])})\b')
strings = [rx.sub(lambda x: lookup[x.group()], s) for s in strings]

Output:

>>> strings
['corpus christi texas', 'san angelo', 'oklahoma city oklahoma', 'abilenesweetwater']

See the online Python demo.

CodePudding user response:

Here is another approach to replace the strings. Assuming that the state code is always in the end of the string, you write a regex pattern to find such strings, and replace it with the value in your lookup dictionary.

import re    
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
_patt = re.compile(r'\s(\w{2})$')

strings = [re.sub(_patt.search(s).group(1), lookup.get(_patt.search(s).group(1)), s) if _patt.search(s) else s for s in strings]

Output:

['corpus christi texas', 'san angelo', 'oklahomalahoma city oklahoma', 'abilenesweetwater']

To understand this code in detail, here is the easier implementation of the same code:

for each in strings:
    match = _patt.search(each)
    if match:
        each = re.sub(match.group(1), lookup.get(match.group(1)), each)
        print(each)
  • Related