I am running a Training Job using the Sagemaker API. The code for configuring the estimator looks as follows (I shrinked the full path names a bit):
s3_input = "s3://sagemaker-studio-****/training-inputs".format(bucket)
s3_images = "s3://sagemaker-studio-****/dataset"
s3_labels = "s3://sagemaker-studio-****/labels"
s3_output = 's3://sagemaker-studio-****/output'.format(bucket)
cfg='{}/input/models/'.format(s3_input)
weights='{}/input/data/weights/'.format(s3_input)
outpath='{}/'.format(s3_output)
images='{}/'.format(s3_images)
labels='{}/'.format(s3_labels)
hyperparameters = {
"epochs": 1,
"batch-size": 2
}
inputs = {
"cfg": TrainingInput(cfg),
"images": TrainingInput(images),
"weights": TrainingInput(weights),
"labels": TrainingInput(labels)
}
estimator = PyTorch(
entry_point='train.py',
source_dir='s3://sagemaker-studio-****/input/input.tar.gz',
image_uri=container,
role=get_execution_role(),
instance_count=1,
instance_type='ml.g4dn.xlarge',
input_mode='File',
output_path=outpath,
train_output=outpath,
base_job_name='visualsearch',
hyperparameters=hyperparameters,
framework_version='1.9',
py_version='py38'
)
estimator.fit(inputs)
Everything runs fine and I get the success message:
Results saved to #033[1mruns/train/exp#033[0m
2022-07-08 08:38:35,766 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2022-07-08 08:38:35,766 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2022-07-08 08:38:35,767 sagemaker-training-toolkit INFO Reporting training SUCCESS
2022-07-08 08:39:08 Uploading - Uploading generated training model
2022-07-08 08:39:08 Completed - Training job completed
ProfilerReport-1657268881: IssuesFound
Training seconds: 558
Billable seconds: 558
CPU times: user 1.34 s, sys: 146 ms, total: 1.48 s
Wall time: 11min 20s
When I call estimator.model_data
I get a path poiting to a model.tar.gz file s3://sagemaker-studio-****/output/.../model.tar.gz
Sagemaker generated subfoldes into the output folder (which in turn contain a lot of json files and other artifacts):
But the file model.tar.gz
is missing. This file is nowhere to be found. Is there anything I need to change or to add, in order to obtain my model?
Any help is much appreciated.
CodePudding user response:
you need to make sure to store your model output to the right location inside the training container. Sagemaker will upload everything that is stored in the MODEL_DIR directory. You can find the location in the ENV of the training job:
model_dir = os.environ.get("SM_MODEL_DIR")
Normally it is set to opt/ml/model
Ref: