Home > Blockchain >  what does it mean "partitioned data" - S3
what does it mean "partitioned data" - S3

Time:03-08

I want to use Netflix's outputCommitter (Using Spark with Amazon EMR). In the README there are 2 options:

  1. S3DirectoryOutputCommitter - for writing unpartitioned data to S3 with conflict resolution.
  2. S3PartitionedOutputCommitter - for writing partitioned data to S3 with conflict resolution.

I tried to understand the differences but unsuccessfully. Can someone explain what is "partitioned data" in s3?

CodePudding user response:

according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive’s partitioning strategy: different levels of the tree represent different columns."

search in the hadoop docs for the full details.

be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs. but since their work is a reimplementation of the netflix work, they should do the same thing here

CodePudding user response:

it is referring to multipart upload to s3.

This is technique to split big blob of data to smaller pieces, upload all smaller pieces separately and once done, send api call to s3 to tell that all done and then s3 will combine all small blobs to single object.

For more context about multipart:

Multipart is usually recommended for data which is bigger than 100Mb in size. S3 multipart can upload up to 5Tb of files. Part size can be from 5MiB to 5GiB in size. You can handle each part as separate file during upload, you can cancel & re-upload part, which is needed to "pause"/"resume" the upload or handle failed part. You can read more here:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html

p.s. make sure to setup s3 bucket's lifetime rule to remove any unfinished multipart uploads after some time. Otherwise you may end up with unfinished multipart uploads where some parts are uploaded but never combined to single object. E.g. if upload of 5Gb file started, some parts uploaded, let's say 3gb of 5gb total and then for whatever reason all processes was terminated. Those 3gigs will sit in s3 until you will make api call to s3 to remove those parts or combine to final object. I always add lifetime rule to s3 bucket to remove any unfinished multipart uploads after 7 or 14 days. It will introduce a limit though, you must finish your upload in that time period or it will fail. So depends on what you upload, set "enough" time to finish. Depending on network speed and file size, even 14days may be way too little. Or maybe your upload process allows to pause for a month, then you need to allow for parts to stay in bucket with taking into counting pause time as well. W/o lifetime rule parts will stay in bucket for forever by default and you will pay for that space, this allows to have unlimited time for the upload.

  • Related