I have a list of ~60K strings, all looking like this:
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
I also have a dict
of lookups. For an MRE:
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
What I would like to do, is loop through my list of strings, and replace the two-digit state code with the value
in the dictionary, making the resulting list look like:
new_strings = ['corpus christi texas', 'san angelo', 'oklahoma city oklahoma', 'abilenesweetwater']
I have seen many similar questions that are looking to do this where the two-digit state code (or full state name) is a column of a pd.Dataframe
, but not as an independent string. I am assuming I will need a regex
.
I have tried the following:
print("Test", 'corpus christi tx')
new_test_str = re.sub(r'[\s tx \s]', 'texas', 'corpus christi tx')
print("Reply", new_test_str)
Which (incorrectly) yields:
Test corpus christi tx
Reply corpustexaschristexasitexastexastexas
CodePudding user response:
You can create a regex from the dictionary keys to match them as whole words, and get values from the dictionary once the match is found and replace with that value:
import re
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
rx = re.compile(fr'\b(?:{"|".join([key for key in lookup])})\b')
strings = [rx.sub(lambda x: lookup[x.group()], s) for s in strings]
Output:
>>> strings
['corpus christi texas', 'san angelo', 'oklahoma city oklahoma', 'abilenesweetwater']
See the online Python demo.
CodePudding user response:
Here is another approach to replace the strings. Assuming that the state code is always in the end of the string, you write a regex pattern to find such strings, and replace it with the value in your lookup dictionary.
import re
strings = ['corpus christi tx', 'san angelo', 'oklahoma city ok', 'abilenesweetwater']
lookup = {'tx': 'texas', 'ny': 'new york', 'nj': 'new jersey', 'ok': 'oklahoma'}
_patt = re.compile(r'\s(\w{2})$')
strings = [re.sub(_patt.search(s).group(1), lookup.get(_patt.search(s).group(1)), s) if _patt.search(s) else s for s in strings]
Output:
['corpus christi texas', 'san angelo', 'oklahomalahoma city oklahoma', 'abilenesweetwater']
To understand this code in detail, here is the easier implementation of the same code:
for each in strings:
match = _patt.search(each)
if match:
each = re.sub(match.group(1), lookup.get(match.group(1)), each)
print(each)