How to optimize writing a list of dictionaries using json dump into a file?-CodePudding

Currently I'm having the following implementation to write a list of dictionaries into a file. The file_limit_counter below would be initialized to 0 whereas the file_limit will be initialized to let's say 50 for now. So whenever the counter becomes equal to the file limit, it'll start writing the output into a new file. The counter approach is to break the output into multiple files:

self.file = self.generate_new_micro_file_name()
for response_dict in all_response_dict_list:
    if file_limit_counter == file_limit:
        self.propagate_log_msg('wrote {} records '.format(file_limit))
        self.file = self.generate_new_micro_file_name()
        file_limit_counter = 0
    with open(self.file, 'a', encoding="utf8") as open_out_file:
        json.dump(response_dict, open_out_file)
        open_out_file.write('\n')
    file_limit_counter  = 1

The list all_response_dict_list would contain something like this:

somelist = [{"Name":"a1","Age":"24"},{"Name":"a2","Age":"26"}]

and my intention is to have something like this on my output file:

{"Name":"a1","Age":"24"}
{"Name":"a2","Age":"26"}
...

This above approach works fine. But when it comes to a large set of dictionaries for example 5000 it tends to slow down a bit (it takes approximately 10 mins). So it'll be helpful if someone have already come across this kind of scenario. I think it would take less time if we can do the same thing above in parallel i.e. writing multiple dictionaries at the same time into the same file rather than writing one by one.

CodePudding user response：

You could split your list into chunks of size file_limit and then write to the files at once.

Try:

chunks = [all_response_dict_list[i: i file_limit] for i in range(0, len(all_response_dict_list), file_limit)]

for chunk in chunks:
    with open(self.generate_new_micro_file_name()) as outfile:
        json.dump(chunk, outfile)

CodePudding user response：

A slight variation on this theme that runs in <0.6s on my machine:-

import json
import time

start = time.perf_counter()

all_response_dict_list = [{'name': f'a{i}', 'age': i} for i in range(50_000)]

file_limit = 50
file_limit_counter = 0
fnum = 0

def generate_new_micro_file():
    global fnum
    fnum  = 1
    return open(f'base{fnum}.json', 'a', encoding='utf8')

open_out_file = generate_new_micro_file()
for response_dict in all_response_dict_list:
    if file_limit_counter == file_limit:
        open_out_file.close()
        open_out_file = generate_new_micro_file()
        file_limit_counter = 0
    json.dump(response_dict, open_out_file)
    open_out_file.write('\n')
    file_limit_counter  = 1
open_out_file.close()

print(f'Duration={time.perf_counter()-start:.2f}s')

CodePudding user response：

This does 50,000 names to 1,000 files in 1.1 seconds, by keeping the file open rather than doing open/close 50 times.

import json
import time

# Generate.

part1 = time.time()

all_response_dict_list = []
for i in range(50000):
    all_response_dict_list.append( {'name':f'a{i}', 'age':str(i) } )

part2 = time.time()
print( part2 - part1 )

# Write.

file_limit = 50
file_limit_counter = 0
fnum = 0

def generate_new_micro_file_name():
    global fnum
    fnum  = 1
    return f'base{fnum}.json'

open_out_file = open(generate_new_micro_file_name(), 'a', encoding='utf8')
for response_dict in all_response_dict_list:
    if file_limit_counter == file_limit:
        open_out_file = open(generate_new_micro_file_name(), 'a', encoding='utf8')
        file_limit_counter = 0
    json.dump(response_dict, open_out_file)
    open_out_file.write('\n')
    file_limit_counter  = 1

print( time.time() - part2 )