pandas - Remove a particular character as well as the previous and subsequent characters-CodePudding

I have translated Bengali phonetics into English. But after parsing, I got some trash characters, which I want to remove. My data frame looks like this.

col1        
utto্tor        
dokkho্shin     
muuns্si

So I want to remove the trash character along with its previous and following character as well. For example: In the first row, I want to remove ্ - this character and also the character o and t, which is the adjacent of ্ (this) character.

My desired output is looks like the following-

col1            col2
utto্tor        uttor
dokkho্shin     dokkhhin
muuns্si        muuni

P.S. I have got these kind of character by using Avro parser which looks like below:

reversed_text = avro.reverse("উত্তর")
print(reversed_text)

output: utto্tor

col0        col1
উত্তর       utto্tor
দক্ষিণ      dokkho্shin
মুন্সী         muuns্si

CodePudding user response：

You can use str.replace removing all non ascii characters and the characters before/after them:

df['col2'] = df['col1'].str.replace(r'.[^\x00-\x7F].', '', regex=True)

output:

         col1      col2
0     utto্tor     uttor
1  dokkho্shin  dokkhhin
2     muuns্si     muuni

CodePudding user response：

The pandas str accessor should provide you the required functionality. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html

Example:

import pandas as pd

df = pd.DataFrame({'Col1': ['Text1', 'Text2']})
df['Col1'] = df['Col1'].str.replace("Text", "newText")
df

It allows also the use of regular expressions.