AWS Glue python job limits the data amount to write in S3 bucket?-CodePudding

I've created a Glue job to read data from glue catalog and save it to an s3 bucket in parquet format. It works correctly, but the number of items is limited to 20. So every time the job is triggered, only 20 items gets saved in the bucket, and I would like to save all of them. Maybe I'm missing some additional property in the python script.

Here is the script (generated by AWS):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
transformation_ctx = "datasource0"]

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "cargoprobe_data", table_name = "dev_scv_completed_executions", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [*field list*], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://bucketname"}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

CodePudding user response：

This is done automatically in the background, it is called partitioning. You can repartition by calling

partitioned_df = dropnullfields3.repartition(1)

to repartition your DynamicFrame to one file.