I need to replace substrings in a column value in dataframe
Example: I have this column 'code' in a dataframe (in really, the dataframe is very large)
3816R(motor) #I need '3816R'
97224(Eletro)
502812(Defletor)
97252(Defletor)
97525(Eletro)
5725 ( 56)
And I have this list to replace the values:
list = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
I've tried a lot of methods, like:
df['code'] = df['code'].str.replace(list, '')
And regex= True, but anyone method worked to remove the substrings.
How can I do that?
CodePudding user response:
You can try regex replace and regex or condition: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
l = ['(motor)', '(Eletro)', '(Defletor)', '( 56)']
l = [s.replace('(', '\(').replace(')', '\)') for s in l]
regex_str = f"({'|'.join(l)})"
df['code'] = df['code'].str.replace(regex_str, '', regex=True)
The regex_str
will end up with something like
"(\(motor\)|\(Eletro\)|\(Defletor\)|\( 56\))"
CodePudding user response:
If you are certain any and all rows follow the format provided, you could attempt the following by using a lambda function:
df['code_clean'] = df['code'].apply(lambda x: x.split('(')[0])
CodePudding user response:
You can try the regular expression match method: https://docs.python.org/3/library/re.html#re.Pattern.match
df['code'] = df['code'].apply(lambda x: re.match(r'^(\w )\(\w \)',x).group(1))
The first part of the regular expression ^(\w )
, creates a capturing group of any letters or numbers before encountering a parenthesis. The group(1)
then extracts the text.
CodePudding user response:
str.replace will work with one string not a list of strings.. you could probably loop through it
rmlist = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
for repl in rmlist:
df['code'] = df['code'].str.replace(repl, '')
alternatively if your bracketed substring is at the end.. split it at "(" and discard additional column generated..will be faster for sure
df["code"]=df["code"].str.split(pat="(",n=1,expand=True)[0]
str.split is reasonably fast