While training a model on AWS Sagemaker(let us assume training takes 15 hours or more). If our laptop lose internet connection in between, the Kernal on which it is training will die. But the model continues to train (I confirmed this with model.save command, and the model did save in the s3 bucket).

I want to know if there is a way, to track the status/progress of our model training when Kernel dies at Sagemaker environment.

Note: I know we can create a training job under Training - Training Jobs - Create Training Jobs. I just wanted to know if there is any other approach to track if we are not creating the Training Job.

CodePudding user response：

Could you specify the 'Job Name' of the sagemaker training job? You can get the status using an API call if you have the job name. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html

Another note: you can specify the job name of a training job using the 'TrainingJobName' parameter of training requests: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html

CodePudding user response：

Simply check of status

When you run a training job, a log tracker is automatically created in CloudWatch within the "/aws/sagemaker/TrainingJobs" group with the name of your job and in turn one or more sub-logs, based on the number of instances selected.

This already ensures you can track the status of the job even if the kernel dies or if you simply turn off the notebook instance.

Monitor metrics

For sagemaker's built-in algorithms, no configuration action is required since the monitorable metrics are already prepared.

Custom model

On custom models, on the other hand, to have a monitoring graph of metrics, you can configure the log group related to them in CloudWatch (Metrics) as the official documentation explains. at "Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics" and "Define Metrics".

Basically, you just need to add the parameter metric_definitions to your Estimator (or a subclass of it):

metric_definitions=[
   {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
   {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]

this will capture from the print/logger output of your training script the text identified by the regexes you set (which you can clearly change to your liking) and create a tracking within cloudwatch metrics.

A complete code example from doc:

import sagemaker
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="your-own-image-uri",
    role=sagemaker.get_execution_role(), 
    sagemaker_session=sagemaker.Session(),
    instance_count=1,
    instance_type='ml.c4.xlarge',
    metric_definitions=[
       {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
       {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
    ]
)