Home > database >  AWS Glue - Combining all S3 JSON files into S3 Parquet files with a size limit
AWS Glue - Combining all S3 JSON files into S3 Parquet files with a size limit

Time:12-22

How to combine all JSON files from multiple directories into a Parquet file with a 100mb limit? All JSON files are combined to a single Parquet file until it reaches 100mb then it creates another parquet file to continue. All JSON files have the same fields.

I've tried converting all JSON files into Parquet files with both source and destination in S3 and succeeded but was unable to find a way for the multiple JSON files to be combined into a single Parquet file.

Example: 20 JSON files with a size of 8MB each is converted to 2 Parquet files with the size of 96MB(12 JSON files) and 64MB(8 JSON files) respectively

CodePudding user response:

If you are certain the input files are 8 MB each, you can do the following:

dyf = glueContext.create_dynamic_frame_from_options(
           "s3", 
           {'paths': ["s3://awsexamplebucket/"], 
           'groupFiles': 'inPartition', 
           'groupSize': '100663296'}, 
           format="json"
       )

That code block will read your source JSON files and group them in partition in 96 MB blocks. That means that, if no transformation between your input and output is happening, your data will be written as a ~96MB file back to S3 using the following code:

glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": "s3://s3path",
    },
    format_options={
        "useGlueParquetWriter": True
    },
)
  • Related