I am trying to open up some huge json files
papers0 = []
papers1 = []
papers2 = []
papers3 = []
papers4 = []
papers5 = []
papers6 = []
papers7 = []
for x in range(8):
for line in open(f'part_00{x}.json', 'r'):
globals()['papers%s' % x].append(json.loads(line))
However the process above is slow. I wonder if there is some parallelization trick or some other in order to speed it up.
Thank you
CodePudding user response:
If the JSON files are very large then loading them (as Python dictionaries) will be I/O bound. Therefore, multithreading would be appropriate for parallelisation.
Rather than having discrete variables for each dictionary, why not have a single dictionary keyed on the significant numeric part of the filename(s).
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from json import load as LOAD
from sys import stderr as STDERR
NFILES = 8
JDATA = {}
def get_json(n):
try:
with open(f'part_00{n}.json') as j:
return n, LOAD(j)
except Exception as e:
print(e, file=STDERR)
return n, None
def main():
with TPE() as tpe:
JDATA = dict(tpe.map(get_json, range(NFILES)))
if __name__ == '__main__':
main()
After running this, the dictionary representation of the JSON file part_005.json (for example) would be accessible as JDATA[5]
Note that if an exception arises during accessing or processing of any of the files, the relevant dictionary value will be None