Home > OS >  Can I specify S3 bucket for sagemaker.sklearn.estimator's SKLearn?
Can I specify S3 bucket for sagemaker.sklearn.estimator's SKLearn?

Time:09-29

I'm following this example notebook to learn SageMaker's processing jobs API: https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb

I'm trying to modify their code to avoid using the default S3 bucket, namely: s3://sagemaker-<region>-<account_id>/

For their data processing step with the .run method:

from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
    code="preprocessing.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

I was able to modify it to use my own S3 bucket by using the destination parameter like this:

sklearn_processor.run( 
    code=output_bucket_uri   "preprocessing.py", 
    inputs=[ProcessingInput( 
        source=input_bucket_uri   "census-income.csv", 
        destination=path "input/", 
    )], 
    outputs=[ 
        ProcessingOutput( 
            output_name="train_data", 
            source=path "train/", 
            destination=output_bucket_uri   "train/", 
        ), 
        ProcessingOutput( 
            output_name="test_data", 
            source=path "test/", 
            destination=output_bucket_uri   "test/", 
        ), 
    ], 
    arguments=["--train-test-split-ratio", "0.2"], 
)

But for the .fit method:

sklearn.fit({"train": preprocessed_training_data})

I have not been able to find a parameter to pass it so that the output artifacts are saved to a S3 bucket that I specify instead of the default s3 bucket s3://sagemaker-<region>-<account_id>/.

CodePudding user response:

You specify the output artifacts' bucket when you create the SKLearn estimator. SKLearn is a subclass of Framework which is a subclass of EstimatorBase, which has an output_path argument.

Below a snippet from Sagemaker Examples where they are using the Pytorch estimator but it's the same idea:

est = PyTorch(
    entry_point="train.py",
    source_dir="code",  # directory of your training script
    role=role,
    framework_version="1.5.0",
    py_version="py3",
    instance_type=instance_type,
    instance_count=1,
    output_path=output_path,
    hyperparameters={"batch-size": 128, "epochs": 1, "learning-rate": 1e-3, "log-interval": 100},
)

est.fit(...)

Docs:

CodePudding user response:

In case it helps others...

To get .fit() to output to a designated S3 bucket, I ended configuring the estimator with output_path.

Here's the example code:

from sagemaker.sklearn.estimator import SKLearn 

sklearn = SKLearn( 
    entry_point="../processor/code/train.py", 
    output_path=output_bucket_uri, 
    framework_version="0.20.0", 
    instance_type="ml.m5.xlarge", 
    role=role, 
)
sklearn.fit({"train": preprocessed_training_data})

Here's the docs relating to the base estimator class: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html

  • Related