Why pandas is not able to read this csv file and returns 'UnicodeEncodeError'. I tried lot of solutions from stackoverflow (local download, different encoding, change the engine...), but still not working...How to fix it?
import pandas as pd
url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'
pd.read_csv(url,encoding='utf-8')
CodePudding user response:
TL;DR
Your URL contains non ASCII character as the error complains.
Just change:
url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'
For:
url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'
And the problem is fixed.
Solutions
Automatic URL escaping
Reading the error in depth shows that after executing the request to get resource behind the URL, the read_csv
function expects the URL of resource to be ASCII encoded which seems not the be the case for this specific resource.
This call that is made by read_csv
fails miserably:
import urllib.request
urllib.request.urlopen(url)
The problem is due to the accent in pyrénées
that must be escaped to prevent urlopen
to fail. Below a clean way to enforce this requirement:
import urllib.parse
result = urllib.parse.urlparse(url)
replaced = result._replace(path=urllib.parse.quote(result.path))
url = urllib.parse.urlunparse(replaced)
pd.read_csv(url)
Handling dataflow by yourself
Alternatively you can by pass this limitation by handling the complete flow by yourself. Following snippet does the trick:
import io
import gzip
import pandas as pd
import requests
url = 'http://data.insideairbnb.com/france/pyrénées-atlantiques/pays-basque/2022-06-10/data/listings.csv.gz'
response = requests.get(url)
file = io.BytesIO(response.content)
with gzip.open(file, 'rb') as handler:
df = pd.read_csv(handler)
The key is to get the HTTP resource and deflate it then fake the content as a file-like object because read_csv
does read directly CSV strings.