Home > Software engineering >  Does a large Max Degree Of Parallelism cause queuing?
Does a large Max Degree Of Parallelism cause queuing?

Time:11-21

I would like to know if my understanding of setting a Max Degree Of Parallelism (MDOP) value larger than a machines available processor amount causes a queueing effect that I have described below.

Please see this as a purely I/O asynchronous operation:

A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.

If there is a requirement for 100 http end points to be called and the MDOP was set to 100, this will create 100 http request tasks at the same time all run in parallel. The problem is only 16 will ever be handled at once meaning the rest are effectively queued and will be handled once a processor frees resulting in an increased response time. Also to add, the process will be solved down further due to other parts of the system demanding use of the 16 available processors.

Setting the MDOP to half the available processer count (8 for example in a 16 processor machine) means that 8 http request tasks will be in flight at any one time. The response times of the 8 requests will be minimal due to there being no queueing of the tasks as the set MDOP is well under the machines available processor resources. Further to this there are also another 8 processors available to handle any other tasks required by the machine.

The main difference is that the overall response time for 100 calls will return faster with a MDOP of 100 as all 100 tasks were started at the same time, where as with 8 there are only ever 8 requests in flight at once.

CodePudding user response:

The implicit assumption made in the question are not correct.

IO operations are generally far from saturating a core. Synchronous and asynchronous request works results in different behaviours. The former is not efficient and must not be used. Both should not be limited to the number of available cores but to the maximum concurrency of the target device completing the IO operations assuming the software stack is doing its job correctly.

For synchronous requests, most of the time is spent waiting for the operation to complete. For example, for a network operation, the OS send the request buffer to the NIC which send it asynchronously on network link. It takes some time to be sure data has been sent so the NIC needs to wait a bit for this sending request to mark it as completed. It also sometimes need to wait for the link to be ready. During this time, the processor can be free and it can actually queue new requests to the NIC. Not to mention the response from the request sent will take a significant time (during which neither the processor nor the link work for this specific request). When a synchronous operation needs to wait for the target device, the IO scheduler of the OS does a context switch (assuming the user code does a proper passive wait). This enable the processor to actually start new IO requests of other threads or overlap the IO requests with computation when the load is high. If there is not enough threads to do IO operations, then this is the main issue, not the number of cores itself. Increasing the number of thread is not efficient. It just increases the number of context switches and thread migration resulting in significant overheads. Asynchronous operations should be used instead. Regarding the OS stack, they may also causes many context switches, but they are generally more efficiently scheduled by the OS. Moreover, using asynchronous IO operations remove the artificial limitation of the number of threads (ie. the maximum degree of parallelism).

For asynchronous operations, one thread can starts a lot of IO requests before they can actually be completed. Having more cores does not directly means more requests can be completed in a given fixed time. This is only true if the OS IO stack is truly parallel and if the operations are limited by the OS stack rather than the concurrency of the target device (this tends to be true nowadays for example on SSD which are massively parallel). The thing is that modern processors are very fast so few threads should theoretically be enough to saturate the queue of most target device, although in practice, not all OS stacks are efficiently designed for modern IO devices.

Every software and hardware stack have a maximum degree of parallelism meant to saturate the device and so to mitigate the latency of IO requests. Because IO latency is generally high, IO request queues are large. "Queuing" do not mean much here since requests are eventually queued anyway. The question is whether they are queued in the OS stack and not the one of the device, that is if the degree Of parallelism of the software stack (including the OS) is bigger than the one of the target device (which may or may not truly compute incoming request of its request queue in parallel). The answer is generally yes if the target application send a lot of requests and the OS stack to not provide any mechanism to regulate the amount of incoming requests. That being said, some API provides it or even guarantee it (asynchronous IO ring buffers are a good example).

Put it shortly, it depends of the exact target device, the target operating system, the OS API/stack used as well as the application itself. The system can be seen as big platform-dependent dataflow where queues are everywhere so one needs to carefully specify what "MDOP" and "queuing" means in this context.

CodePudding user response:

You cannot expect anyone to know what you mean by MDOP unless you mention the precise technology in the context of which you are using this term. Microsoft SQL Server has a concept of MDOP, but you are talking about HTTP requests, so you are probably not talking about MSSQL. So, what are you talking about? Anyway, on with the question.

A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.

No, it doesn't mean that. It means that the computer can execute 16 CPU instructions simultaneously. (If we disregard pipelines, superscalar pipelines, memory contention, etc.) A "Task" is a very high-level concept which involves all sorts of things besides just executing CPU instructions. For example, it involves waiting for I/O to complete, or even waiting for events to occur, events which might be raised by other tasks.

When a system allows you to set the value of some concept such as a "degree of parallelism", this means that there is no silver bullet for that value, so depending on the application at hand, different values will yield different performance benefits. Knowing your specific usage scenario you can begin with an educated guess for a good value, but the only way to know the optimal value is to try and see how your actual system performs.

Specifically with degree of parallelism, it depends on what your threads are doing. If your threads are mostly computation-bound, then a degree of parallelism close to the number of physical cores will yield best results. If your threads are mostly I/O bound, then your degree of parallelism should be orders of magnitude larger than the number of physical cores, and the ideal value depends on how much memory each thread is consuming, because with too many threads you might start hitting memory bottlenecks.

Proof: check how many threads are currently alive on your computer. (Use some built-in system monitor utility, or download one if need be.) Notice that you have thousands of threads running. And yet look at your CPU utilization: it is probably close to zero. That's because virtually all of those thousands of threads are doing nothing most of the time but waiting for stuff to happen, like for you to press a key or for a network packet to arrive.

  • Related