I have translated Bengali phonetics into English. But after parsing, I got some trash characters, which I want to remove. My data frame looks like this.
col1
utto্tor
dokkho্shin
muuns্si
So I want to remove the trash character along with its previous and following character as well. For example: In the first row, I want to remove ্ - this character and also the character o and t, which is the adjacent of ্ (this) character.
My desired output is looks like the following-
col1 col2
utto্tor uttor
dokkho্shin dokkhhin
muuns্si muuni
P.S. I have got these kind of character by using Avro parser which looks like below:
reversed_text = avro.reverse("উত্তর")
print(reversed_text)
output: utto্tor
col0 col1
উত্তর utto্tor
দক্ষিণ dokkho্shin
মুন্সী muuns্si
CodePudding user response:
You can use str.replace
removing all non ascii characters and the characters before/after them:
df['col2'] = df['col1'].str.replace(r'.[^\x00-\x7F].', '', regex=True)
output:
col1 col2
0 utto্tor uttor
1 dokkho্shin dokkhhin
2 muuns্si muuni
CodePudding user response:
The pandas str accessor should provide you the required functionality. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html
Example:
import pandas as pd
df = pd.DataFrame({'Col1': ['Text1', 'Text2']})
df['Col1'] = df['Col1'].str.replace("Text", "newText")
df
It allows also the use of regular expressions.