How do I change the values in a pandas column that are selected by a regex?-CodePudding

I'm cleaning up data for a personal project and am standardizing the large number of categories. The seemingly low hanging fruit have similar enough names such as:

'SUSPECIOUS CRAFT', 'SUSPECTED MILITANTS', 'SUSPECTED PIRATE','SUSPECTED TERRORISTS', 'SUSPICICIOUS APPROACH', 'SUSPICIOPUS APPROACH', 'SUSPICIOUS APPRAOCH','SUSPICIOUS APPROACH', 'SUSPICIOUS BOAT', 'SUSPICIOUS BOATS', 'SUSPICIOUS CRAFT', 'SUSPICIOUS CRAFTS', 'SUSPICIOUS VESSEL', 'SUSPICOUS APPROACH', 'SUSPICUIOUS APPROACH','SUSPIPICIOUS APPROACH', 'SUSPISIOUC CRAFT', 'SUS[ICIOUS APPROACH'

There are others, including ones with lower and mixed case, so I'm using regex. I can select what I'm looking for with this (note that I added #8619:

df[df["hostility"].str.contains(r"^Su(s|c)(p|])(i|e)", regex=True, case=False)]

        year    hostility                victim
878     2018    Suspicious Approach     Tug
7060    2001    SUSPICIOUS CRAFT        MERCHANT VESSEL
7068    2001    Suspicious group onboard a trawler      YACHT
7723    2000    SUSPICIOUS CRAFT        MERCHANT VESSEL
8619    2004    Protest                 tug 
10001   2003    SUSPICIOUS CRAFT        MERCHANT VESSEL

But I'm stuck at replacing all the variations so that they would like like this:

        year    hostility               victim
878     2018    Suspicious Approach     Tug
7060    2001    Suspicious Approach     MERCHANT VESSEL
7068    2001    Suspicious Approach     YACHT
7723    2000    Suspicious Approach     MERCHANT VESSEL
8619    2004    Protest                 tug 
10001   2003    Suspicious Approach     MERCHANT VESSEL

What is the most effective to do this?

CodePudding user response：

You can use a vectorized Series.str.replace method directly to replace the whole string that starts with the pattern of your choice. Note that it is not efficient to use groups with single character alternatives, regex offers you character classes for that. E.g. do not use (c|d), use [cd] instead which is much more efficient (see Why is a character class faster than alternation?).

So, you can use

df['hostility'] = df['hostility'].str.replace(r'(?i)^Su[sc][][p][ie].*', 'Suspicious Approach', regex=True)

Note that the regex is case insensitive due to the use of the (?i) inline modifier and regex=True makes the method treat the search argument as a regular expression.

Details:

(?i) - case insensitive modifier on
^ - start of string
Su - Su string
[sc] - s or c
[][p] - ], [ or p char (note you do not have to escape [ inside a character class, and ] if it is at the character class start position)
[ie] - i or e
.* - the rest of the line (if you need to match line breaks, replace (?i) with (?si) and . will match line breaks, too).