file.json
[{"id":1, "name":"Tim"}, {"id":2, "name":"Jim"}, {"id":3, "name":"Paul"}, {"id":4, "name":"Sam"}]
It's encoded as 'UTF-8 with BOM"
When I use pandas, it works
df = pd.read_json('file.json', encoding='utf-8-sig', orient='records')
Successful
When I use dask, it fails
df = dd.read_json('file.json', encoding='utf-8-sig', orient='records')
ValueError: An error occurred while calling the read_json method registered to the pandas backend. Original Message: Expected object or value
I am trying to read the data in a dask df. The original message leads me to believe it's a parse issue but could this be a bug? Does dask not have the same encoding options as pandas?
CodePudding user response:
By default dask.dataframe.read_json
will expect the raw data to be line-delimited json, this can be changed by specifying lines=False
as a kwarg. Here's a MRE:
data = [
{"id": 1, "name": "Tim"},
{"id": 2, "name": "Jim"},
{"id": 3, "name": "Paul"},
{"id": 4, "name": "Sam"},
]
from json import dumps
with open("file.json", "w", encoding="utf-8-sig") as f:
f.write(dumps(data))
from dask.dataframe import read_json
df = read_json("file.json", encoding="utf-8-sig", lines=False)
print(df.compute())
# id name
# 0 1 Tim
# 1 2 Jim
# 2 3 Paul
# 3 4 Sam