Writing Pyspark DF to S3 faster-CodePudding

I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark. While doing so, it takes around 5-7 mins to upload a chunk of 100K records.

This process will take months for the data pull as there are around 3,108,700,000 recs in source. Is there any better way by which the S3 upload process can be improved.

NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.

Here is how I am writing the DF to S3.

df = spark.read.format("jdbc").
     option('url', jdbcURL).
     option('driver', driver).
     option('user', user_name).
     option('password', password).
     option('query', data_query).load()



output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)

CodePudding user response：

Reparation is a good move as writing large files to S3 is better than writing small files.

Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.

S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.

S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)

Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.