I've had an unexpected issue with a script that was working fine for me before which I can't get to the bottom of. Any suggestions would be very much appreciated.
I'm replacing strings in a Pandas dataframe column using the below dictionary. The word boundaries '\b' aren't working as expected for VICTORIA. It replaces with '\x08VIC\x08'
instead of 'VIC'
.
statesShort = {'\\bNEW SOUTH WALES\\b': '\\bNSW\\b', '\\bVICTORIA\\b': '\\bVIC\\b',
'\\bQUEENSLAND\\b': '\\bQLD\\b', '\\bWESTERN AUSTRALIA\\b': '\\bWA\\b',
'\\bTASMANIA\\b': '\\bTAS\\b', '\\bNORTHERN TERRITORY\\b': '\\bNT\\b',
'\\bSOUTH AUSTRALIA\\b': '\\bSA\\b', '\\bAUSTRALIAN CAPITAL TERRITORY\\b': '\\bACT\\b'}
test['state'].astype(str).replace(statesShort, regex=True).unique()
Printed result:
array(['WA', 'NT', '\x08VIC\x08', 'SA', 'VIC', 'NSW', 'ACT', 'QLD', 'TAS'], dtype=object)
CodePudding user response:
You cannot use regular expressions in the replacement part in re.sub
.
You need to rremove all \\b
strings in the replacements and use
statesShort = {
r'\bNEW SOUTH WALES\b': 'NSW',
r'\bVICTORIA\b': 'VIC',
r'\bQUEENSLAND\b': 'QLD',
r'\bWESTERN AUSTRALIA\b': 'WA',
r'\bTASMANIA\b': 'TAS',
r'\bNORTHERN TERRITORY\b': 'NT',
r'\bSOUTH AUSTRALIA\b': 'SA',
r'\bAUSTRALIAN CAPITAL TERRITORY\b': 'ACT'
}
Note the use of raw string literals where a backslash denotes a literal backslash.