Home > database >  Loading hugh amounts of json files, Error
Loading hugh amounts of json files, Error

Time:06-08

im new to programming and just started 7 days ago, i hope my queastion is not that stupid. I do not want to be the "Can someone code me this..." guy. To name a keyword or a method on the basis of which I can search further myself would already help me a lot.

I have a hughe amount (500000-1000000) Json files in a dictionary, all have the same formate. My goal is to load the files and to write some Values into another CSV file. I already started to write a code, but if im running it, it stops after ~80 files and gives the following error:

Traceback (most recent call last):
  File "C:\Users\Hauke\PycharmProjects\Json Merge\CSV.py", line 14, in <module>
    data = json.load(g)
  File "C:\Users\Hauke\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Hauke\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 8934: character maps to <undefined>

I do not really understand why? I guess it is becauze the program can not load so many json files? I know that the "if" part is really a mess, if it has no impact on my problem, you guys can ignore that part.

My Code:

import json
import os
import csv
import time

start = time.time()

for path, dirs, files in os.walk("Json"):
    for f in files:
        fileName = os.path.join(path, f)
        print(fileName)

        with open(fileName, "r") as g:
            data = json.load(g)


            if data.get("Retweeted") == True:
                name1 = data.get("ScreenName")
                rtstatus = data.get("RetweetedStatus")
                rtent = rtstatus.get("User")
                name = rtent.get("ScreenNameResponse")
                fields = [name1, name]
                with open("Import.csv", "a", newline="") as t:
                    writer = csv.writer(t)
                    writer.writerow(fields)

            if data.get("InReplyToScreenName") != "null":
                name2 = data.get("ScreenName")
                RPName = data.get("InReplyToScreenName")
                fields2 = [name2, RPName]
                with open("Import.csv", "a", newline="") as u:
                    writer2 = csv.writer(u)
                    writer2.writerow(fields2)


end = time.time()
print('Time taken for fun program: ', end - start)

Thanks for your help, I hope I'm not too stupid.

CodePudding user response:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 8934: character maps to <undefined> it is a decoding issue. you could ignore the decoding issue or find out the issue part

 with open(fileName, "r") as g:
  # data = json.load(g)
  data = g.read()
  data_str = data.decode("utf-8", errors='ignore')
  json.load(data_str)

CodePudding user response:

json.load is not efficient for loading a huge json file since it loads the whole file into memory before parsing.

You should use the "streaming" solution in order to parse huge json file, that means that you are going to parse it one chunk at a time instead of all the file at once.

import json

with open(filename, 'r') as f:
    for line in f:
        data = json.loads(line)
  • you can also the ujson library, it's basically the json library you know but faster (written in pure C).
  • Related