Unable to remove box looking character (non ascii) using regex in python-CodePudding

I have a character that looks like

It has an odd box looking character before the word 'info' contained in a column in a dataframe. I want to remove this. So far i have tried removing this by using a method to remove non ascii characters but it does not seem to work. Please help.

The code that i have tried are:

df['column_name']=df['column_name'].apply(lambda x : re.sub(r'[^\x00-\x7F]', '', x))

and

df['column_name']=df['column_name'].replace((r'[^\x00-\x7F]', '')

but it does not work

CodePudding user response：

Vectorize your function before applying it:

import re
import numpy as np

def removeNonAscii(s):
    return re.sub(r'[^\x00-\x7f]', "", s)

df['column_name'] = df['column_name'].apply(np.vectorize(removeNonAscii))

CodePudding user response：

You can specify regex=True and if you want inplace=True and repeat the character class 1 or more times to replace consecutive non ASCII chars as one empty string.

df = pd.DataFrame(["aÀnÑ,!?'\\"], columns=["column_name"])
df['column_name'].replace(r'[^\x00-\x7F] ', '', inplace=True, regex=True)

print(df)

Output

  column_name
0     an,!?'\