Remove specific combination of characters in dataframe colum?-CodePudding

I have the following issue where I have some data that has a specific combination of characters that I need to remove, example:

data_col
*.test1.934n
test1.tedsdh
*.test1.test.sdfsdf
jhsdakn
*.test2.test

What I need to remove is all the instances that exist for the "*." character combination in the dataframe. So far I've tried:

df['data_col'].str.replace('^*.','')

However when I run the code it gives me this error:

re.error: nothing to repeat at position 1

Any advise on how to fix this? Thanks in advance.

CodePudding user response：

The default behaviour of .str.replace in pandas version 1.4.2 or earlier is to treat the replacememnt pattern as a regular expression. If you are using regular expressions to match characters with special meaning like * and . you have to escape them with backslashes:

df['data_col'].str.replace(r'^\*\.', '', regex=True)

Note that I used raw string literals to make sure that backslashes are treated as is. I also added regex=True, because otherwise pandas complains that in future it will not treat patterns as regex. Due to ^ at the beginning, this regex will only match the beginning of each string.

However, it is also possible that you don't need regular expressions in this particular case at all.

If you want to remove any instance of *. in your strings (not only the beginning ones), you can just do it with

df['data_col'].str.replace('*.', '', regex=False)

If you want to remove instance of *. only at the beginning of the string, you can use .removeprefix instead:

df['data_col'].str.removeprefix('*.')