Regex for unwanted characteres-CodePudding

everyone!

I'm trying to save a pandas data frame to Postgresql, but I'm getting many encoding errors due to some "Latin 1" characters. So I tried to replace those characters using the following:

df = df.replace(r'\u2019 |\u2013', ' ', regex=True)

Although it's working, I would like a better way since I don't know how many of those characters are in the data frame. I noticed that all begin with \u2, so I tried using the code shown below:

df = df.replace(r'\\u[0-9]', ' ', regex-True)

The latter way doesn't work. Can you guys can give me tips on how to solve this problem.

Regards,

Marcio

CodePudding user response：

Use utf8 encoding when you reading from that file.

df = pd.read_csv('something.csv', encoding='utf8')

CodePudding user response：

Could you rely on a whitelist of (negated) good characters?

df = df.replace('[^a-zA-Z0-9 ] ', ' ', regex=True)