Home > Back-end >  Regex for unwanted characteres
Regex for unwanted characteres

Time:04-28

everyone!

I'm trying to save a pandas data frame to Postgresql, but I'm getting many encoding errors due to some "Latin 1" characters. So I tried to replace those characters using the following:

df = df.replace(r'\u2019 |\u2013', ' ', regex=True)

Although it's working, I would like a better way since I don't know how many of those characters are in the data frame. I noticed that all begin with \u2, so I tried using the code shown below:

df = df.replace(r'\\u[0-9]', ' ', regex-True)

The latter way doesn't work. Can you guys can give me tips on how to solve this problem.

Regards,

Marcio

CodePudding user response:

Use utf8 encoding when you reading from that file.

df = pd.read_csv('something.csv', encoding='utf8')

CodePudding user response:

Could you rely on a whitelist of (negated) good characters?

df = df.replace('[^a-zA-Z0-9 ] ', ' ', regex=True)
  • Related