Home > database >  How to upload larger than 5Tb object to Google Cloud Storage?
How to upload larger than 5Tb object to Google Cloud Storage?

Time:02-17

Trying to save a PostgreSQL backup (~20 Tb) to Google Cloud Storage for the long-term, and I am currently piping PostgreSQL pg_dump() command to an streaming transfer through gsutil.

pg_dump -d $DB_NAME -b --format=t \
    | gsutil cp - gs://$BUCKET_NAME/$BACKUP_FILE

However, I am worried that the process will crash because of GCS' 5Tb object size limit.

Is there any way to upload larger than 5Tb objects to Google Cloud Storage?

EDITION: using split?

I am considering to pipe pg_dump to Linux's split utility and the gsutil cp.

pg_dump -d $DB -b --format=t \
    | split -b 50G - \
    | gsutil cp - gs://$BUCKET/$BACKUP

Would something like that work?

CodePudding user response:

As mentioned by Ferregina Pelona, guillaume blaquiere and John Hanley. There is no way to bypass the 5-TB limit implemented by Google, as mentioned in this document:

Cloud Storage 5TB object size limit

Cloud Storage supports a maximum single-object size up to 5 terabytes. If you have objects larger than 5TB, the object transfer fails for those objects for either Cloud Storage or Transfer for on-premises.

If the file surpasses the limit (5 TB), the transfer fails.

You can use Google's issue tracker to request this feature, within the link provided, you can check the features that were requested or request a feature that satisfies your expectations.

CodePudding user response:

You generally don't want to upload a single object in the multi-terabyte range without a streaming transfer. Streaming transfers have two major downsides, and they're both very bad news for you:

  1. Streaming Transfers don't use Cloud Storage's checksum support. You'll get regular HTTP data integrity checking, but that's it, and for periodic 5 TB uploads, there's a nonzero chance that this could eventually end up in a corrupt backup.
  2. Streaming Transfers can't be resumed if they fail. Assuming you're uploading at 100 Mbps around the clock, a 5 TB upload would take at least 4 and a half days, and if your HTTP connection failed, you'd need to start over from scratch.

Instead, here's what I would suggest:

  1. First, minimize the file size. pg_dump has a number of options for reducing the file size. It's possible something like "--format=c -Z9" might produce a much smaller file.
  2. Second, if possible, store the dump as a file (or, preferably, a series of split up files) before uploading. This is good because you'll be able to calculate their checksums, which gsutil can take advantage of, and also you'd be able to manually verify that they uploaded correctly if you wanted. Of course, this may not be practical because you'll need a spare 5TB of hard drive space, but unless your database won't be changing for a few days, there may not be an easy alternative to retry in case you lose your connection.
  • Related