How can I replace substring from string by a list in a column dataframe?-CodePudding

I need to replace substrings in a column value in dataframe

Example: I have this column 'code' in a dataframe (in really, the dataframe is very large)

3816R(motor) #I need '3816R'
97224(Eletro)
502812(Defletor)
97252(Defletor)
97525(Eletro)
5725 ( 56)

And I have this list to replace the values:

list = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']

I've tried a lot of methods, like:

df['code'] = df['code'].str.replace(list, '')

And regex= True, but anyone method worked to remove the substrings.

How can I do that?

CodePudding user response：

You can try regex replace and regex or condition: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/

l = ['(motor)', '(Eletro)', '(Defletor)', '( 56)']
l = [s.replace('(', '\(').replace(')', '\)') for s in l]
regex_str = f"({'|'.join(l)})"
df['code'] = df['code'].str.replace(regex_str, '', regex=True)

The regex_str will end up with something like

"(\(motor\)|\(Eletro\)|\(Defletor\)|\( 56\))"

CodePudding user response：

If you are certain any and all rows follow the format provided, you could attempt the following by using a lambda function:

df['code_clean'] = df['code'].apply(lambda x: x.split('(')[0])

CodePudding user response：

You can try the regular expression match method: https://docs.python.org/3/library/re.html#re.Pattern.match

df['code'] = df['code'].apply(lambda x: re.match(r'^(\w )\(\w \)',x).group(1))

The first part of the regular expression ^(\w ), creates a capturing group of any letters or numbers before encountering a parenthesis. The group(1) then extracts the text.

CodePudding user response：

str.replace will work with one string not a list of strings.. you could probably loop through it

rmlist = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
for repl in rmlist:
    df['code'] = df['code'].str.replace(repl, '')

alternatively if your bracketed substring is at the end.. split it at "(" and discard additional column generated..will be faster for sure

df["code"]=df["code"].str.split(pat="(",n=1,expand=True)[0]

str.split is reasonably fast