How to access a file inside sagemaker entrypoint script-CodePudding

I want to know how to access a private bucket S3 file or a folder inside script.py entry point of sagemaker . I uploaded the file to S3 using following code

boto3_client = boto3.Session(
                        region_name='us-east-1',
                        aws_access_key_id='xxxxxxxxxxx',
                        aws_secret_access_key='xxxxxxxxxxx'
)

sess = sagemaker.Session(boto3_client)  
role=sagemaker.session.get_execution_role(sagemaker_session=sess)
inputs = sess.upload_data(path="df.csv", bucket=sess.default_bucket(), key_prefix=prefix)

This is the code of estimator

import sagemaker
from sagemaker.pytorch import PyTorch


pytorch_estimator = PyTorch(
    entry_point='script.py',
    instance_type='ml.g4dn.xlarge',
    source_dir = './',
    role=role,
    sagemaker_session=sess,

)

Now inside script.py file i want to access the df.csv file from s3. This is my code inside script.py.

parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
args, _ = parser.parse_known_args()

#create session
sess=Session(boto3.Session(
                        region_name='us-east-1'))
S3Downloader.download(s3_uri=args.data_dir,
                          local_path='./',
                          sagemaker_session=sess)

df=pd.read_csv('df.csv')

But this is giving error

ValueError: Expecting 's3' scheme, got:  in /opt/ml/input/data/training., exit code: 1

I think one way is to pass secret key and access key. But i am already passing sagemaker_session. How can i call that session inside script.py file and get my file read.

CodePudding user response：

I think this approach is conceptually wrong.

Files within sagemaker jobs (whether training or otherwise) should be passed during machine initialization. Imagine you have to create a job with 10 machines, do you want to read the file 10 times or replicate it directly by having it read once?

In the case of the training job, they should be passed into the fit (in the case of direct code like yours) or as TrainingInput in the case of pipeline.

You can follow this official AWS example: "Train an MNIST model with PyTorch"

However, the important part is simply passing a dictionary of input channels to the fit:

pytorch_estimator.fit({'training': s3_input_train})

You can put the name of the channel (in this case 'train') any way you want. The path s3 will be the one in your df.csv.

Within your script.py, you can read the df.csv directly between environment variables (or at least be able to specify it between argparse). Generic code with this default will suffice:

parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"])

It follows the nomenclature "SM_CHANNEL_" your_channel_name. So if you had put "train": s3_path, the variable would have been called SM_CHANNEL_TRAIN.

Then you can read your file directly by pointing to the path corresponding to that environment variable.