Home > Software design >  Match parenthesis at the end of a string
Match parenthesis at the end of a string

Time:05-12

I have a dataframe that has one column of strings. I need to clean this column and remove any text within a parenthesis from each string. For example

Names
Mike (5 friends)
Tom
Joe (2 friends)
Alex

I want df to look like this after:

Names
Mike
Tom
Joe
Alex

Currently my code looks like this

import re

for i in df["Names"]:
    if i contains r"\([^()]*\)"
        i = re.sub(r"\([^()]*\)", "", i)

But I am getting a syntax error on the if statement line. What do I need to set my if statement conntains condition to in order to make this work, while staying dynamic for the number inside the parenthesis. Thanks

I used the following code on an isolated string and it worked as i wanted. I'm having trouble understanding why this same line wouldn't also work as my "contains" condition

re.sub(r"\([^()]*\)", ""

CodePudding user response:

Use str.replace:

df["Names"] = df["Names"].str.replace(r'\s*\(.*?\)$', '')

Here is a regex demo showing that the above replacement logic is working.

CodePudding user response:


df['Names'] = df['Names'].str.replace(r"\(.[^\)] .",'',regex=True)

OR

import re
lst = ["Mike (5 friends)",
"Tom",
"Joe (2 friends)",
"Alex",]

new_lst = [re.sub(r'\(.[^\)] .','',a).strip() for a in lst]
print(new_lst)

OUTPUT

['Mike', 'Tom', 'Joe', 'Alex']

CodePudding user response:

try this. It will go through the string letter wise and will trim it off when opening parenthesis is found.

def CustomParser(word):
    trim_position = -1
    for j in len(word):
        letter = word[j:j 1]
        if letter == "(":
            trim_position = j

    return word[0,trim_position].strip()

CodePudding user response:

You need to use

df["Names"] = df["Names"].str.replace(r'\s*\([^()]*\)$', '', regex=True)

If there can be trailing whitespaces:

df["Names"] = df["Names"].str.replace(r'\s*\([^()]*\)\s*$', '', regex=True)

Details:

  • \s* - zero or more whitespaces
  • \( - a ( char
  • [^()]* - zero or more chars other than ( and )
  • \) - a ) char
  • $ - end of string.

NOTE on regex=True:

Acc. to Pandas 1.2.0 release notes:

The default value of regex for Series.str.replace() will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).

  • Related