I am using SageMaker for distributed TensorFlow model training and serving. I am trying to get the shape of the pre-processed datasets from the ScriptProcessor so I can provide it to the TensorFlow Environment.
script_processor = ScriptProcessor(command=['python3'],
image_uri=preprocess_img_uri,
role=role,
instance_count=1,
sagemaker_session=sm_session,
instance_type=preprocess_instance_type)
script_processor.run(code=preprocess_script_uri,
inputs=[ProcessingInput(
source=source_dir username '/' dataset_name,
destination='/opt/ml/processing/input')],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test")
],
arguments = ['--filepath', dataset_name, '--labels', 'labels', '--test_size', '0.2', '--shuffle', 'False', '--lookback', '5',])
preprocessing_job_description = script_processor.jobs[-1].describe()
output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
if output["OutputName"] == "train_data":
preprocessed_training_data = output["S3Output"]["S3Uri"]
if output["OutputName"] == "test_data":
preprocessed_test_data = output["S3Output"]["S3Uri"]
I would like to get the following data:
pre_processed_train_data_shape = script_processor.train_data_shape?
I am just not sure how to get the value out of the docker container. I have reviewed the documentation here:https://sagemaker.readthedocs.io/en/stable/api/training/processing.html
CodePudding user response:
There are a couple of options:
Write some data to a plain text file at /opt/ml/output/message , then call DescribeProcessingJob (using Boto3 or the AWS CLI or API) and retrieve the ExitMessage value
Add a new output to your processing job and send data there
If your train_data is in CSV, JSON, or Parquet then use AWS Data Wrangler S3 select_query to query train_data for it's # or rows/columsn