Home > Mobile >  No space left on device error with pyspark aws glue
No space left on device error with pyspark aws glue

Time:12-09

I am using AWS glue to extract dynamoDB items into S3. I read all the items using the pyspark and was glue and apply a transformation on the items retrieved from DynamoDB and write into S3. But I always run into the error "No space left on device."

The worker type I use is G.1X, and each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and the size of dynamoDB is 6GB.

Based on the AWS documentation, During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity How can I set the shuffling programmatically? Please find my sample code below,

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Map
from awsglue.transforms import Filter
from pyspark import SparkConf

conf = SparkConf()
glue_context = GlueContext(SparkContext.getOrCreate())



# mytable got id and uri
resources_table_dynamic_frame = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.input.tableName": "my_table",
        "dynamodb.throughput.read.percent": "0.4",
        "dynamodb.splits": "8"
    }
)

# Filter out rows whose ids are same
def filter_new_id(dynamicRecord):
    uri = dynamicRecord['Uri']
    uri_split = uri.split(":")
    # Get the internal ID
    internal_id = uri_split[1]
    print(dynamicRecord)

    if internal_id == dynamicRecord['id']:
        return False

    return True


# Keep only the items whose IDs are different.
resource_with_old_id = Filter.apply(
    frame=resources_table_dynamic_frame,
    f=lambda x: filter_new_id(x),
    transformation_ctx='resource_with_old_id'
)

glue_context.write_dynamic_frame_from_options(
    frame=resource_with_old_id,
    connection_type="s3",
    connection_options={"path": "s3://path/"},
    format="json"
)

CodePudding user response:

I addressed this issue with the following tweak in the code posted in OP.

resources_table_dynamic_frame = glue_context.create_dynamic_frame.from_options(
   connection_type="dynamodb",
   connection_options={
       "dynamodb.input.tableName": "my_table",
       "dynamodb.throughput.read.percent": "0.5",
       "dynamodb.splits": "200"
   },
   additional_options={
      "boundedFiles" : "30000"
   }
)

I added boundedFiles as suggested in AWS doc here and increased the dynamodb.splits to make it work for me.

  • Related