I'm following this example notebook to learn SageMaker's processing jobs API: https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb
I'm trying to modify their code to avoid using the default S3 bucket, namely: s3://sagemaker-<region>-<account_id>/
For their data processing step with the .run
method:
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
code="preprocessing.py",
inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)
I was able to modify it to use my own S3 bucket by using the destination
parameter like this:
sklearn_processor.run(
code=output_bucket_uri "preprocessing.py",
inputs=[ProcessingInput(
source=input_bucket_uri "census-income.csv",
destination=path "input/",
)],
outputs=[
ProcessingOutput(
output_name="train_data",
source=path "train/",
destination=output_bucket_uri "train/",
),
ProcessingOutput(
output_name="test_data",
source=path "test/",
destination=output_bucket_uri "test/",
),
],
arguments=["--train-test-split-ratio", "0.2"],
)
But for the .fit
method:
sklearn.fit({"train": preprocessed_training_data})
I have not been able to find a parameter to pass it so that the output artifacts are saved to a S3 bucket that I specify instead of the default s3 bucket s3://sagemaker-<region>-<account_id>/
.
CodePudding user response:
You specify the output artifacts' bucket when you create the SKLearn
estimator.
SKLearn
is a subclass of Framework
which is a subclass of EstimatorBase
, which has an output_path
argument.
Below a snippet from Sagemaker Examples where they are using the Pytorch
estimator but it's the same idea:
est = PyTorch(
entry_point="train.py",
source_dir="code", # directory of your training script
role=role,
framework_version="1.5.0",
py_version="py3",
instance_type=instance_type,
instance_count=1,
output_path=output_path,
hyperparameters={"batch-size": 128, "epochs": 1, "learning-rate": 1e-3, "log-interval": 100},
)
est.fit(...)
Docs:
CodePudding user response:
In case it helps others...
To get .fit()
to output to a designated S3 bucket, I ended configuring the estimator with output_path
.
Here's the example code:
from sagemaker.sklearn.estimator import SKLearn
sklearn = SKLearn(
entry_point="../processor/code/train.py",
output_path=output_bucket_uri,
framework_version="0.20.0",
instance_type="ml.m5.xlarge",
role=role,
)
sklearn.fit({"train": preprocessed_training_data})
Here's the docs relating to the base estimator class: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html