Tips for increasing performance for bson.json_util dump function-CodePudding

I have a service that reads from mongo and needs to dump all the records with the same metadata_id into a local temp file. Is there a way to optimize/speed up the bson.json_util dumping portion? The querying part, where everything is loaded into the cursor always takes less than 30sec for hundreads of Mbs, but then the dumping part takes around 1h.

It took 3days to archive ~0.2TB of data.

def dump_snapshot_to_local_file(mongo, database, collection, metadata_id, file_path, dry_run=False):
    """
       Creates a gz archive for all documents with the same metadata_id
    """

    cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
    path = file_path   '/'   database   '/'   collection   '/'
    create_directory(path)
    path = path   metadata_id   '.json.gz'

    ok = False
    try:
        with gzip.open(path, 'wb') as file:
            logging.info("Saving to temp location %s", path)
            file.write(b'{"documents":[')
            for document in cursor:
                if ok:
                    file.write(b',')
                ok = True
                file.write(dumps(document).encode())
            file.write(b']}')
    except IOError as e:
        logging.error("Failed exporting data with metadata_id %s to gz. Error: %s", metadata_id, e.strerror)
        return False
    finally:
        file.close()

    if not is_gz_file(path):
        logging.error("Failed to create gzip file for data with metadata_id %", metadata_id)
        return False

    logging.info("Data with metadata_id %s was successfully saved at temp location", metadata_id)
    return True

Would there be a better approach to do this?

Any tip would be greatly appreciated.

CodePudding user response：

Since I wasn't using any of the JSONOptions functionality, and the service was spending most of its time doing the json_util dumps, stepping away from it and dumping directly into bson, without the json conversion, proved to save 35min from the orginial 40min load (on a 1.8mil documents, ~3.5GB)

    try:
        with gzip.open(path, 'wb') as file:
            logging.info("Saving snapshot to temp location %s", path)
            for document in cursor:
                file.write(bson.BSON.encode(document))