I have a service that reads from mongo and needs to dump all the records with the same metadata_id into a local temp file. Is there a way to optimize/speed up the bson.json_util dumping portion? The querying part, where everything is loaded into the cursor always takes less than 30sec for hundreads of Mbs, but then the dumping part takes around 1h.
It took 3days to archive ~0.2TB of data.
def dump_snapshot_to_local_file(mongo, database, collection, metadata_id, file_path, dry_run=False):
"""
Creates a gz archive for all documents with the same metadata_id
"""
cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
path = file_path '/' database '/' collection '/'
create_directory(path)
path = path metadata_id '.json.gz'
ok = False
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving to temp location %s", path)
file.write(b'{"documents":[')
for document in cursor:
if ok:
file.write(b',')
ok = True
file.write(dumps(document).encode())
file.write(b']}')
except IOError as e:
logging.error("Failed exporting data with metadata_id %s to gz. Error: %s", metadata_id, e.strerror)
return False
finally:
file.close()
if not is_gz_file(path):
logging.error("Failed to create gzip file for data with metadata_id %", metadata_id)
return False
logging.info("Data with metadata_id %s was successfully saved at temp location", metadata_id)
return True
Would there be a better approach to do this?
Any tip would be greatly appreciated.
CodePudding user response:
Since I wasn't using any of the JSONOptions functionality, and the service was spending most of its time doing the json_util dumps, stepping away from it and dumping directly into bson, without the json conversion, proved to save 35min from the orginial 40min load (on a 1.8mil documents, ~3.5GB)
try:
with gzip.open(path, 'wb') as file:
logging.info("Saving snapshot to temp location %s", path)
for document in cursor:
file.write(bson.BSON.encode(document))