Home > Enterprise >  Regex Word Boundary Issue
Regex Word Boundary Issue

Time:06-08

I've had an unexpected issue with a script that was working fine for me before which I can't get to the bottom of. Any suggestions would be very much appreciated.

I'm replacing strings in a Pandas dataframe column using the below dictionary. The word boundaries '\b' aren't working as expected for VICTORIA. It replaces with '\x08VIC\x08' instead of 'VIC'.

statesShort = {'\\bNEW SOUTH WALES\\b': '\\bNSW\\b', '\\bVICTORIA\\b': '\\bVIC\\b',
 '\\bQUEENSLAND\\b': '\\bQLD\\b', '\\bWESTERN AUSTRALIA\\b': '\\bWA\\b',
 '\\bTASMANIA\\b': '\\bTAS\\b', '\\bNORTHERN TERRITORY\\b': '\\bNT\\b',
 '\\bSOUTH AUSTRALIA\\b': '\\bSA\\b', '\\bAUSTRALIAN CAPITAL TERRITORY\\b': '\\bACT\\b'}

test['state'].astype(str).replace(statesShort, regex=True).unique()

Printed result:

array(['WA', 'NT', '\x08VIC\x08', 'SA', 'VIC', 'NSW', 'ACT', 'QLD', 'TAS'], dtype=object)

CodePudding user response:

You cannot use regular expressions in the replacement part in re.sub.

You need to rremove all \\b strings in the replacements and use

statesShort = {
    r'\bNEW SOUTH WALES\b': 'NSW',
    r'\bVICTORIA\b': 'VIC',
    r'\bQUEENSLAND\b': 'QLD', 
    r'\bWESTERN AUSTRALIA\b': 'WA', 
    r'\bTASMANIA\b': 'TAS', 
    r'\bNORTHERN TERRITORY\b': 'NT',
    r'\bSOUTH AUSTRALIA\b': 'SA', 
    r'\bAUSTRALIAN CAPITAL TERRITORY\b': 'ACT'
}

Note the use of raw string literals where a backslash denotes a literal backslash.

  • Related