Home > Software design >  Unicode error when reading a csv file with pandas
Unicode error when reading a csv file with pandas

Time:09-27

Why pandas is not able to read this csv file and returns 'UnicodeEncodeError'. I tried lot of solutions from stackoverflow (local download, different encoding, change the engine...), but still not working...How to fix it?

import pandas as pd
url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'

pd.read_csv(url,encoding='utf-8')

CodePudding user response:

TL;DR

Your URL contains non ASCII character as the error complains.

Just change:

url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'

For:

url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'

And the problem is fixed.

Solutions

Automatic URL escaping

Reading the error in depth shows that after executing the request to get resource behind the URL, the read_csv function expects the URL of resource to be ASCII encoded which seems not the be the case for this specific resource.

This call that is made by read_csv fails miserably:

import urllib.request
urllib.request.urlopen(url)

The problem is due to the accent in pyrénées that must be escaped to prevent urlopen to fail. Below a clean way to enforce this requirement:

import urllib.parse

result = urllib.parse.urlparse(url)
replaced = result._replace(path=urllib.parse.quote(result.path))
url = urllib.parse.urlunparse(replaced)

pd.read_csv(url)

Handling dataflow by yourself

Alternatively you can by pass this limitation by handling the complete flow by yourself. Following snippet does the trick:

import io
import gzip
import pandas as pd
import requests

url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'
response = requests.get(url)
file = io.BytesIO(response.content)
with gzip.open(file, 'rb') as handler:
     df = pd.read_csv(handler)

The key is to get the HTTP resource and deflate it then fake the content as a file-like object because read_csv does read directly CSV strings.

  • Related