Home > Back-end >  Remove suffix if string matches regular expression in pandas
Remove suffix if string matches regular expression in pandas

Time:11-24

In one of my columns, I would like to remove a suffix from certain strings where it matches a pattern. This is a snippet of my column data:

col1
DS-71007-002-1
DN-80013-002-6
2"-VH-11008-MW01-03

This is what I would like:

col1
DS-71007-002
DN-80013-002
2"-VH-11008-MW01-03

I'm basically just remove the -1 and -6 from the first two lines.

I'm not very good with regular expressions. I just came up with this:

[A-Z][A-Z]-[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9]

But unsure how to go about then removing the suffixes from strings that match this.

CodePudding user response:

Using a regex with a lookbehind:

df['col2'] = df['col1'].str.replace(r'(?<=^\w{2}-\d{5}-\d{3})(-\d )', '', regex=True)

output:

                  col1                 col2
0       DS-71007-002-1         DS-71007-002
1       DN-80013-002-6         DN-80013-002
2  2"-VH-11008-MW01-03  2"-VH-11008-MW01-03

You can test the regex here, in summary this is

(?<=   # lookbehind
^      # match beginning of string
\w{2}  # 2 word characters
-      # a dash
\d{5}  # 5 digits
-      # a dash
\d{3}  # 3 digits
)      # end of lookbehind
(      # capturing group (what will be replaced by the empty string)
-\d    # 1 dash and one or more digits
)      # end of capturing group

CodePudding user response:

You can use numpy.where with Series.str.split. No need of regex:

In [915]: import numpy as np

In [916]: df['col1'] = np.where(df.col1.str.split('-').str[-2].str.isnumeric(), df.col1.str.split('-').str[:-1].str.join('-'), df.col1)

In [917]: df
Out[917]: 
                  col1
0         DS-71007-002
1         DN-80013-002
2  2"-VH-11008-MW01-03
  • Related