Home > Software engineering >  Sagemaker creates output folders but no model.tar.gz after successful completion of the Training Job
Sagemaker creates output folders but no model.tar.gz after successful completion of the Training Job

Time:07-08

I am running a Training Job using the Sagemaker API. The code for configuring the estimator looks as follows (I shrinked the full path names a bit):

s3_input = "s3://sagemaker-studio-****/training-inputs".format(bucket)
s3_images = "s3://sagemaker-studio-****/dataset"
s3_labels = "s3://sagemaker-studio-****/labels"
s3_output = 's3://sagemaker-studio-****/output'.format(bucket)

cfg='{}/input/models/'.format(s3_input)
weights='{}/input/data/weights/'.format(s3_input)
outpath='{}/'.format(s3_output)
images='{}/'.format(s3_images)
labels='{}/'.format(s3_labels)

hyperparameters = {
    "epochs": 1,
    "batch-size": 2
}

inputs = {
    "cfg": TrainingInput(cfg),
    "images": TrainingInput(images),
    "weights": TrainingInput(weights),
    "labels": TrainingInput(labels)
}

estimator = PyTorch(
    entry_point='train.py',
    source_dir='s3://sagemaker-studio-****/input/input.tar.gz',
    image_uri=container,
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    input_mode='File',
    output_path=outpath,
    train_output=outpath,
    base_job_name='visualsearch',
    hyperparameters=hyperparameters,
    framework_version='1.9',
    py_version='py38'
)

estimator.fit(inputs)

Everything runs fine and I get the success message:

Results saved to #033[1mruns/train/exp#033[0m
2022-07-08 08:38:35,766 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2022-07-08 08:38:35,766 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.
2022-07-08 08:38:35,767 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2022-07-08 08:39:08 Uploading - Uploading generated training model
2022-07-08 08:39:08 Completed - Training job completed
ProfilerReport-1657268881: IssuesFound
Training seconds: 558
Billable seconds: 558
CPU times: user 1.34 s, sys: 146 ms, total: 1.48 s
Wall time: 11min 20s

When I call estimator.model_data I get a path poiting to a model.tar.gz file s3://sagemaker-studio-****/output/.../model.tar.gz

Sagemaker generated subfoldes into the output folder (which in turn contain a lot of json files and other artifacts):

enter image description here

But the file model.tar.gz is missing. This file is nowhere to be found. Is there anything I need to change or to add, in order to obtain my model?

Any help is much appreciated.

CodePudding user response:

you need to make sure to store your model output to the right location inside the training container. Sagemaker will upload everything that is stored in the MODEL_DIR directory. You can find the location in the ENV of the training job:

model_dir = os.environ.get("SM_MODEL_DIR")

Normally it is set to opt/ml/model

Ref:

  • Related