Bucket sort or merge sort?-CodePudding

I am doing an c assignment where I have to sort data (n=400) which is student scores from 0-100. I am confused on using bucket sort, which sorts algorithm into buckets or merge sort, which divides and conquers. Which one should I use and why?

CodePudding user response：

Bucket sort performs poorly when the data set is not distributed well since most of the items will fall into a few popular buckets. In your case, It's reasonable to assume that the most of the student scores will more or less be around the median score and only have few outliers. Therefore I would argue that the merge sort performs better in this context since it is not affected by the distribution of the data set.

Additional Consideration

There could be an argument that bucket sort is better if we can adjust the bucket ranges according to the expected distribution of the data set. Sure, if we hit the jackpot and predicted the distribution really well, it can significantly speed up the sorting process. However, the downside of this is that the sorting performance can plummet when our prediction goes wrong i.e. getting unexpected data set. For example, the test being too easy/difficult might lead to this "unexpected data set" in the context of this question. In other words, bucket sort has better best-case time complexity, where as merge sort has better worst-case time complexity. Which metric to use for comparing algorithms depends on the needs of each application. In practice, the worst-case time complexity is usually found to be more useful and I think the same could be said for this specific question. It's also a plus that we don't suffer the additional cost of calculating/adjusting the bucket ranges if we go for merge sort.

CodePudding user response：

The answer depends on your data. However, merge sort will run in O(n log n) while bucket sort will run in O(n b) where b is the number of buckets you have. If scores are from zero to (and including) 100, then b is 101. So the question is of O(n log n) runs faster than O(n 101) which is an easy question to answer theoretically, since O(n 101) = O(n) and clearly O(n) is faster than O(n log n). Even if we did the (admittedly silly) exercise of substituting n for 400 we would get 501 for bucket sort and with log2(400) = 9 (rounded up) 3600 for merge sort. But that is silly, because big-O notation doesn't work that way. Theoretically, we would just conclude that O(n) is better than O(n log n).

But that is the theoretical answer. In practise, the overhead hidden behind the big-O counts, and then it might not be as simple.

That being said, the overhead in bucket sort is usually smaller than for merge sort. You need to allocate an array for some counts and an array to put the output in, and after that you need to run through the input twice, first for counting and then for sorting. A simple bucket sort could look like this:

#include <iostream>
#include <string>

// Some fake data
struct student
{
    int score;
    std::string name;
};
struct student scores[] = {
    {45, "jack"},
    {12, "jill"},
    {99, "john"},
    {89, "james"}};

void bucket_sort(int n, struct student in[n], struct student out[n])
{
    int buckets[101]; // range 0-100 with 100 included
    for (int i = 0; i < 101; i  )
    {
        buckets[i] = 0;
    }

    // get offsets for each bucket
    for (int i = 0; i < n; i  )
    {
        buckets[in[i].score]  ;
    }
    int acc = 0;
    for (int i = 0; i < 101; i  )
    {
        int b = buckets[i];
        buckets[i] = acc;
        acc  = b;
    }

    // Bucket the scores
    for (int i = 0; i < n; i  )
    {
        out[buckets[in[i].score]  ] = in[i];
    }
}

void print_students(int n, struct student students[n])
{
    for (int i = 0; i < n; i  )
    {
        std::cout << students[i].score << ' ' << students[i].name << std::endl;
    }
    std::cout << std::endl;
}

int main(void)
{

    int no_students = sizeof scores / sizeof scores[0];
    print_students(no_students, scores);

    struct student sorted[no_students];
    bucket_sort(no_students, scores, sorted);
    print_students(no_students, sorted);

    return 0;
}

(excuse my C , it's been more than 10 years since I used the language, so the code might look a bit more C like than it should).

The best way to work out what is faster in practise is, of course, to measure it. Compare std::sort with something like the above, and you should get your answer.

If it wasn't for an assignment, though, I wouldn't recommend you to experiment. The built-in std::sort can easily handle 400 elements faster than you need, and there is no need to implement new sorting algorithms for something like that. For an exercise, though, it could be fun to do some measuring and experiments.