Speed up importing huge json files-CodePudding

I am trying to open up some huge json files

papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []

for x in range(8):
    for line in open(f'part_00{x}.json', 'r'):
        globals()['papers%s' % x].append(json.loads(line))

However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.

Thank you

CodePudding user response：

If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.

Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).

For example:

from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR

NFILES = 8
JDATA = {}

def get_json(n):
    try:
        with open(f'part_00{n}.json') as j:
            return n, LOAD(j)
    except Exception as e:
        print(e, file=STDERR)
    return n, None

def main():
    with TPE() as tpe:
        JDATA = dict(tpe.map(get_json, range(NFILES)))

if __name__ == '__main__':
    main()

After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]

Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None