Home > front end >  Regex: if string ends with numerical values - drop it, OR if string ends with single letter - drop i
Regex: if string ends with numerical values - drop it, OR if string ends with single letter - drop i

Time:04-23

Here is a sample of a dataset I have:

ID Project
1 21st Townhouse 318
2 The Residences 6
3 Villanova Tower B
4 The Hills H
5 City Park

I need to transform 'Project' column so that:

  1. if a row ends with numeric values, they should be dropped
  2. if a row ends with single letter, it should be dropped
  3. else, leave as it is

Here is how I want it to look like:

ID Project
1 21st Townhouse
2 The Residences
3 Villanova Tower
4 The Hills
5 City Park

I tried to search for some solution, and found this(for first condition with numeric values only):

df['Project']=df.Project[~((df.Project.astype(str).str.match("(.*\d)")) & (df.Project.astype(str).str.len() > 1))] 

It worked, however, I tried to apply it for the second condition as well:

df['Project']=df.Project[~((df.Project.astype(str).str.match("(.*\w)")) & (df.Project.astype(str).str.len() == 1))] 

But, It failed

Can you help me, please? Thank you!

CodePudding user response:

You can use

df['Project']=df['Project'].str.replace(r'\s (?:\d |[A-Za-z])$', '', regex=True)

See the regex demo.

Details:

  • \s - one or more whitespaces
  • (?:\d |[A-Za-z]) - a non-capturing group matching either
    • \d - one or more digits
    • | - or
    • [A-Za-z] - an ASCII letter (use [^\W\d_] to match any Unicode letter)
  • $ - end of string.

If your strings have no whitespace before the last number/letter, you may use

df['Project'].str.replace(r'\b(?:\d |[A-Za-z])$', '', regex=True).str.rstrip()

The \b matches a word boundary making sure there is no letter, digit or _ right before the final digits/letter, and .str.rstrip() will strip the trailing whitespace.

  • Related