Home > Net >  Running custom script as part of Sagemaker Pipelines
Running custom script as part of Sagemaker Pipelines

Time:11-29

Exploring the Sagemaker Python SDK and trying to setup a minimal pipeline: A custom python script which reads two csv files from S3 folder, processes data and writes a single file back to S3. Can't seem to find any examples/documentation which explains this process.

  1. Will having two input files (data/headers are different in each file) create problems?
  2. What processor to use? Most examples seem to use SKLearnProcessor or PySparkProcessor. I don't use either of the frameworks. Just need to do some simple data processing using Pandas.
  3. What is the use/need for ScriptProcessor? Do I need to use this for running a custom script?
  4. How to pass on package dependencies? Most dependencies are open source packages but there are a couple of packages that are hosted on a private CodeArtifact repository.

Most of the examples found on the web seem to use either SKLearnProcessor or PySparkProcessor.

CodePudding user response:

Your task can be solved with a simple processing job.

By using SKLearnProcessor, you already imply the use of a scikit-learn container. Otherwise, you can use a generic ScriptProcessor and specify the most appropriate container (already existing in sagemaker, such as that of scikit-learn or one of your own completely customised).

Within the scikit-learn container, libraries such as pandas and numpy are already present. You can look at the complete list of requirements here.

Below is an example of code that answers the questions on file input/output from S3:

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,  # e.g. "1.0-1", 
    role=role, 
    instance_type=your_instance_type,  # e.g. 'ml.m5.large'
    base_job_name = your_base_job_name,
    instance_count=your_instance_count,  # e.g. 1
)

sklearn_processor.run(
    code=your_script_path,
    inputs=[
        ProcessingInput(
            input_name='insert-custom-name-for-first-file',
            source=first_file_s3_uri,
            destination="/opt/ml/processing/input/data",
            s3_data_type='S3Prefix',
            s3_input_mode="File"
        ),
        ProcessingInput(
            input_name='insert-custom-name-for-second-file',
            source=second_file_s3_uri,
            destination="/opt/ml/processing/input/data",
            s3_data_type='S3Prefix',
            s3_input_mode="File"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="output-channel-name",
            destination=output_s3_path_uri,
            source="/opt/ml/processing/processed_data"
        )
    ]
)

If both input files reside in the same folder on S3, you can directly load a single ProcessingInput pointing to that folder instead of separating the two files. However, since they are only two, I recommend distinguishing them as I have done in the example.


For dependencies: if they are modules to be loaded, you could pass them just like ProcessingInput. See the run() documentation under source_dir, dependencies and git_config. That way you can choose the optimal configuration for your task.


In conclusion, using the sklearn container directly is not wrong. Install maybe something you don't need, but it's not a lot of stuff. If you have no particular need for compatibility with other libraries, use this ready-made container. Otherwise, go with a custom container.

  • Related