Import data to Amazon AWS SageMaker from S3 or EC2-CodePudding

For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.

In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:

Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).

Added in Edit: I am now trying to upload the uncompressed data using AWS CLI V2.

References:

CodePudding user response：

The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.

This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.

You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip). If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.

For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.

CodePudding user response：

You already have the data in S3 zipped. What's left is:

Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.

This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.

Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.