I'm cleaning up data for a personal project and am standardizing the large number of categories. The seemingly low hanging fruit have similar enough names such as:
'SUSPECIOUS CRAFT', 'SUSPECTED MILITANTS', 'SUSPECTED PIRATE','SUSPECTED TERRORISTS', 'SUSPICICIOUS APPROACH', 'SUSPICIOPUS APPROACH', 'SUSPICIOUS APPRAOCH','SUSPICIOUS APPROACH', 'SUSPICIOUS BOAT', 'SUSPICIOUS BOATS', 'SUSPICIOUS CRAFT', 'SUSPICIOUS CRAFTS', 'SUSPICIOUS VESSEL', 'SUSPICOUS APPROACH', 'SUSPICUIOUS APPROACH','SUSPIPICIOUS APPROACH', 'SUSPISIOUC CRAFT', 'SUS[ICIOUS APPROACH'
There are others, including ones with lower and mixed case, so I'm using regex. I can select what I'm looking for with this (note that I added #8619:
df[df["hostility"].str.contains(r"^Su(s|c)(p|])(i|e)", regex=True, case=False)]
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 SUSPICIOUS CRAFT MERCHANT VESSEL
7068 2001 Suspicious group onboard a trawler YACHT
7723 2000 SUSPICIOUS CRAFT MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 SUSPICIOUS CRAFT MERCHANT VESSEL
But I'm stuck at replacing all the variations so that they would like like this:
year hostility victim
878 2018 Suspicious Approach Tug
7060 2001 Suspicious Approach MERCHANT VESSEL
7068 2001 Suspicious Approach YACHT
7723 2000 Suspicious Approach MERCHANT VESSEL
8619 2004 Protest tug
10001 2003 Suspicious Approach MERCHANT VESSEL
What is the most effective to do this?
CodePudding user response:
You can use a vectorized Series.str.replace
method directly to replace the whole string that starts with the pattern of your choice. Note that it is not efficient to use groups with single character alternatives, regex offers you character classes for that. E.g. do not use (c|d)
, use [cd]
instead which is much more efficient (see Why is a character class faster than alternation?).
So, you can use
df['hostility'] = df['hostility'].str.replace(r'(?i)^Su[sc][][p][ie].*', 'Suspicious Approach', regex=True)
Note that the regex is case insensitive due to the use of the (?i)
inline modifier and regex=True
makes the method treat the search argument as a regular expression.
Details:
(?i)
- case insensitive modifier on^
- start of stringSu
-Su
string[sc]
-s
orc
[][p]
-]
,[
orp
char (note you do not have to escape[
inside a character class, and]
if it is at the character class start position)[ie]
-i
ore
.*
- the rest of the line (if you need to match line breaks, replace(?i)
with(?si)
and.
will match line breaks, too).