I have 1TB of data from the parquet S3 to be loaded in AWS Glue Spark Jobs. I am trying to figure out the number of workers needed for this type of requirement.
As per me below are the details of the G.1x
configuration:
1 DPU added for MasterNode
1 DPU reserved for Driver/ApplicationMaster
Each worker is configured with 1 executor
Each executor is configured with 10 GB of memory
Each executor is configured with 8 cores
EBS block of 64GB per DPU
So if I take 50 workers
. 1 would be parked aside for the driver and 1 for the master node. So, I am left with 48 now. So, 48 * 10 = 480 GB Memory (since 1 executor takes 10 GB memory). Also, 64 * 48 = 3072 GB ~ 3 TB disk. In case there is any data spill required then a disk would be used.
So, is this configuration correct? If not, do I need to increase or decrease the workers? Any help is much appreciated. Also, if in the future I have lots of collect
operations involved then how can I increase the driver
memory which is 16GB
for now?
CodePudding user response:
To start with there is no direct statistical or mathematical formula to come up with the number of DPUs needed because it depends on the nature of the problem that one is trying to solve as in:
- Do the job need to be finished as fast as possible? Maximum parallelism and finish faster?
- Will it be a long-lived job or a short-running job?
- Will the job parses a lot of small (in KBs) files or large chunks (in 100s MB )?
- Cost considerations? given the cost is per DPU hour and Job run duration matters
- Frequency of the job (every hour or once a day)? this will help you to determine for example if the job takes 35mins with a low number of DPUs (thereby finishing in less than an hour then it might be acceptable because it helps in saving cost)
Now to your question, assuming you are using Glue2.0, in order to estimate the number of DPUs (or workers) needed you should actually enable the job metrics in AWS Glue that can give you the required insight to understand the job execution time, active executors, completed stages, and maximum needed executors to scale in/out your AWS Glue job. Using these metrics you can visualize and determine the optimal number of DPUs needed for your situation.
You can try running a dummy job or the actual job for 1 time, use the metrics and determine an optimal number of DPUs (from cost and job finish time) perspective.
For example, try running with 50 workers, analyze your under-provisioning factor, then use the factor to scale your current capacity.
You can read more on this AWS link and external link.
For your other question about increasing the driver memory, I would suggest reaching out to your AWS support or trying the using G2.X which has 20gb driver memory.