Parse big JSON file with font-encoding cp1252-CodePudding

I have to handle a big JSON file (approx. 47GB) and it seems as if I found the solution in ijson.

However, when I want to go through the objects I get the following error:

byggesag = (o for o in objects if o["h�ndelse"] == 'Byggesag')
                                                             ^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe6 in position 12: invalid continuation byte

Here is the code I am using so far:

import ijson

with open("C:/Path/To/Json/JSON_20220703180000.json", "r", encoding="cp1252") as json_file:
    objects = ijson.items(json_file, 'SagList.item')
    byggesag = (o for o in objects if o['hændelse'] == 'Byggesag')

How can I deal with the encoding of the input file?

CodePudding user response：

The problem is with the python script itself, which is encoded with cp1252 but python expects it to be in utf8. You seem to be dealing with the input JSON file correctly (but you won't be able to tell until you actually are able to run your script).

First, note that the error is a SyntaxError, which probably happens when you are loading your script/module.

Secondly, note how in the first bit of code you shared hændelse appears somewhat scrambled, and python is complaining about how utf-8 cannot handle byte 0xe6. This is becase the character æ (U 00E6, https://www.compart.com/de/unicode/U 00E6) is encoded as 0xe6 in cp1252, which isn't a valid utf8 byte sequence; hence the error.

To solve it save your python script with utf8 encoding, or specify that it's saved with cp1252 (see https://peps.python.org/pep-0263/ for reference).