In one of my columns, I would like to remove a suffix from certain strings where it matches a pattern. This is a snippet of my column data:
col1
DS-71007-002-1
DN-80013-002-6
2"-VH-11008-MW01-03
This is what I would like:
col1
DS-71007-002
DN-80013-002
2"-VH-11008-MW01-03
I'm basically just remove the -1
and -6
from the first two lines.
I'm not very good with regular expressions. I just came up with this:
[A-Z][A-Z]-[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9]
But unsure how to go about then removing the suffixes from strings that match this.
CodePudding user response:
Using a regex with a lookbehind:
df['col2'] = df['col1'].str.replace(r'(?<=^\w{2}-\d{5}-\d{3})(-\d )', '', regex=True)
output:
col1 col2
0 DS-71007-002-1 DS-71007-002
1 DN-80013-002-6 DN-80013-002
2 2"-VH-11008-MW01-03 2"-VH-11008-MW01-03
You can test the regex here, in summary this is
(?<= # lookbehind
^ # match beginning of string
\w{2} # 2 word characters
- # a dash
\d{5} # 5 digits
- # a dash
\d{3} # 3 digits
) # end of lookbehind
( # capturing group (what will be replaced by the empty string)
-\d # 1 dash and one or more digits
) # end of capturing group
CodePudding user response:
You can use numpy.where
with Series.str.split
. No need of regex
:
In [915]: import numpy as np
In [916]: df['col1'] = np.where(df.col1.str.split('-').str[-2].str.isnumeric(), df.col1.str.split('-').str[:-1].str.join('-'), df.col1)
In [917]: df
Out[917]:
col1
0 DS-71007-002
1 DN-80013-002
2 2"-VH-11008-MW01-03