Home > OS >  Replace string if it meets any condition values in python
Replace string if it meets any condition values in python

Time:09-29

I have a pandas DataFrame and would like to delete the sub-string in all values if it meets the condition that the string = 'test', 'tests', 'testing', 'orig' and 'new'. I can use str.replace() to replace the value for a single string condition, but unsure how to do include additional conditions to test for and replace.

See below:

df_1 = pd.DataFrame({'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
                     'color_value': ['blue_test', 'red', 'yellow_tests', 'orange_orig',
                     'blue_new','red', 'blue_testing', 'orange']})

For one condition, I can do:

term = 'test'
df_1['color_value'] = df_1['color_value'].str.replace(term,'')

How can I extend it to include removing 'tests', 'testing', 'orig' and 'new'?

CodePudding user response:

Use a regular expression:

term = 'test(s|ing)?'
df_1['color_value'] = df_1['color_value'].str.replace(term, '', regex=True)
print(df_1)

Output

    id color_value
0  001       blue_
1  002         red
2  003     yellow_
3  004      orange
4  005        blue
5  006         red
6  007       blue_
7  008      orange

From the documentation on str.replace:

pat str or compiled regex
String can be a character sequence or regular expression.

UPDATE

For including "new", "origin" you could do use another regex:

term = 'test(s|ing)?|new|orig'
df_1['color_value'] = df_1['color_value'].str.replace(term, '', regex=True)
print(df_1)

Output

    id color_value
0  001       blue_
1  002         red
2  003     yellow_
3  004     orange_
4  005       blue_
5  006         red
6  007       blue_
7  008      orange

General Solution

If you have many words I suggest you use a library such as trrex it will build a regular expression from a set of words:

import pandas as pd
import trrex as tx

df_1 = pd.DataFrame({'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
                     'color_value': ['blue_test', 'red', 'yellow_tests', 'orange_orig',
                                     'blue_new', 'red', 'blue_testing', 'orange']})

term = tx.make(['test', 'tests', 'testing', 'orig', 'new'], prefix="", suffix="")
df_1['color_value'] = df_1['color_value'].str.replace(term, '', regex=True)
print(df_1)

Output

    id color_value
0  001       blue_
1  002         red
2  003     yellow_
3  004     orange_
4  005       blue_
5  006         red
6  007       blue_
7  008      orange

The pattern for the given example is:

term = tx.make(['test', 'tests', 'testing', 'orig', 'new'], prefix="", suffix="")
print(term)

Output (pattern build by trrex)

(?:test(?:ing|s)?|new|orig)

DISCLAIMER

I'm the author of trrex

  • Related