I have a pandas DataFrame and would like to delete the sub-string in all values if it meets the condition that the string = 'test', 'tests', 'testing', 'orig' and 'new'. I can use str.replace() to replace the value for a single string condition, but unsure how to do include additional conditions to test for and replace.
See below:
df_1 = pd.DataFrame({'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
'color_value': ['blue_test', 'red', 'yellow_tests', 'orange_orig',
'blue_new','red', 'blue_testing', 'orange']})
For one condition, I can do:
term = 'test'
df_1['color_value'] = df_1['color_value'].str.replace(term,'')
How can I extend it to include removing 'tests', 'testing', 'orig' and 'new'?
CodePudding user response:
Use a regular expression:
term = 'test(s|ing)?'
df_1['color_value'] = df_1['color_value'].str.replace(term, '', regex=True)
print(df_1)
Output
id color_value
0 001 blue_
1 002 red
2 003 yellow_
3 004 orange
4 005 blue
5 006 red
6 007 blue_
7 008 orange
From the documentation on str.replace:
pat str or compiled regex
String can be a character sequence or regular expression.
UPDATE
For including "new", "origin"
you could do use another regex:
term = 'test(s|ing)?|new|orig'
df_1['color_value'] = df_1['color_value'].str.replace(term, '', regex=True)
print(df_1)
Output
id color_value
0 001 blue_
1 002 red
2 003 yellow_
3 004 orange_
4 005 blue_
5 006 red
6 007 blue_
7 008 orange
General Solution
If you have many words I suggest you use a library such as trrex it will build a regular expression from a set of words:
import pandas as pd
import trrex as tx
df_1 = pd.DataFrame({'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
'color_value': ['blue_test', 'red', 'yellow_tests', 'orange_orig',
'blue_new', 'red', 'blue_testing', 'orange']})
term = tx.make(['test', 'tests', 'testing', 'orig', 'new'], prefix="", suffix="")
df_1['color_value'] = df_1['color_value'].str.replace(term, '', regex=True)
print(df_1)
Output
id color_value
0 001 blue_
1 002 red
2 003 yellow_
3 004 orange_
4 005 blue_
5 006 red
6 007 blue_
7 008 orange
The pattern for the given example is:
term = tx.make(['test', 'tests', 'testing', 'orig', 'new'], prefix="", suffix="")
print(term)
Output (pattern build by trrex)
(?:test(?:ing|s)?|new|orig)
DISCLAIMER
I'm the author of trrex