Why does Pandas "utf-8-sig" encoding work but Dask doesn't?-CodePudding

file.json

[{"id":1, "name":"Tim"},
{"id":2, "name":"Jim"},
{"id":3, "name":"Paul"},
{"id":4, "name":"Sam"}]

It's encoded as 'UTF-8 with BOM"

When I use pandas, it works

df = pd.read_json('file.json',
    encoding='utf-8-sig',
    orient='records')

Successful

When I use dask, it fails

df = dd.read_json('file.json',
    encoding='utf-8-sig',
    orient='records')
ValueError: An error occurred while calling the read_json method registered to the pandas backend. Original Message: Expected object or value

I am trying to read the data in a dask df. The original message leads me to believe it's a parse issue but could this be a bug? Does dask not have the same encoding options as pandas?

CodePudding user response：

By default dask.dataframe.read_json will expect the raw data to be line-delimited json, this can be changed by specifying lines=False as a kwarg. Here's a MRE:

data = [
    {"id": 1, "name": "Tim"},
    {"id": 2, "name": "Jim"},
    {"id": 3, "name": "Paul"},
    {"id": 4, "name": "Sam"},
]

from json import dumps

with open("file.json", "w", encoding="utf-8-sig") as f:
    f.write(dumps(data))

from dask.dataframe import read_json

df = read_json("file.json", encoding="utf-8-sig", lines=False)
print(df.compute())
#    id  name
# 0   1   Tim
# 1   2   Jim
# 2   3  Paul
# 3   4   Sam