Time complexity of distributed merge sort-CodePudding

If there are N elements (N very large) and we have to split the data to K machines to do a merge sort. What is the time complexity?

My attempt is that since each machine takes N/K data, the sorting on each machine takes O(N/k*log(N/k)). The k-way merge take O(Nlogk). So total complexity is O( N/k*log(N/k) Nlogk).

Could someone please confirm if above is correct? And if given k << N, can I simplify this further to just O(N/k*log(N/k))?

CodePudding user response：

First, your simplification is not correct. Suppose that K = O(sqrt(N)). Then the O(N log(K)) bit is O(N log(N)). But your simplified version says it should be O(sqrt(N) log(N)) which is clearly incorrect.

Second, your estimate for a K-way merge is appropriate IF all of the merging happens on one machine. But there are ways to distribute the merge. For example take even samples from each machine, and sort this sample. Use that to figure out how to partition the data into K approximately equal samples. Send those partitions to all of the machines, which split the data and sends each piece to the correct place. Now each machine does a K-way merge on O(N/K) of the data. How much data you want for that sample and what the overhead is, will depend how big K is relative to N. But it isn't hard to find K and N such that the total time to do the work of sorting the data is o(N).

But we now get a new bottleneck. Dividing data, and reassembling it. If you're doing this on a single machine, that's unavoidably O(N). But if you keep data on a distributed filesystem, sort it, and assemble the answer on another distributed filesystem, you can eliminate all of the bottlenecks! (Operationally, at scale, you want everything distributed all of the time for exactly this reason.)

CodePudding user response：

A merge-sort combines two sorted arrays to produce a sorted array comprised of the elements of the two arrays. I understand you wish to perform partial merge-sorts and then combine the results to produce a sorted array.

For convenience I will assume that the two arrays, a and b are of the same size, n*m, for some positive integers n and m. Suppose one partitioned each array into n arrays of size m: [a1, a2,.., an] and [b1, b2,.., bn].

Now merge-sort each of the n pairs of arrays [ai, bi] to produce sorted arrays c = [c1, c2,..., cn]. That has a computational complexity of O(n*m*log(m)), but that is of no matter, as the heavy-lifting has not yet begun.

We now need need to merge-sort the arrays of c. Suppose we first merge-sort c1 and c2 to produce the 2*m-element array d. Then merge-sort c3 into d, resulting in d having 3*m elements, and so on. The complexity of this operation is O(n*(n*m)*log(n*m)), whereas doing a simple merge-sort on a and b is O((n*m)*log(n*m)). This divide-and-conquer approach obviously is not advised.

If instead a and each bi were merge-sorted, that would be O(n*(n*m)log(n*m)), already making it a loser. (Combining the n sorted arrays of n*m elements would have the same big-O.)