Home > Software design >  Speed up importing huge json files
Speed up importing huge json files

Time:12-02

I am trying to open up some huge json files

papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []

for x in range(8):
    for line in open(f'part_00{x}.json', 'r'):
        globals()['papers%s' % x].append(json.loads(line))

However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.

Thank you

CodePudding user response:

If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.

Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).

For example:

from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR

NFILES = 8
JDATA = {}

def get_json(n):
    try:
        with open(f'part_00{n}.json') as j:
            return n, LOAD(j)
    except Exception as e:
        print(e, file=STDERR)
    return n, None

def main():
    with TPE() as tpe:
        JDATA = dict(tpe.map(get_json, range(NFILES)))

if __name__ == '__main__':
    main()

After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]

Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None

  • Related