Confused about complexity of insertion sort, and benchmark-CodePudding

Using c# benchmark I get the following results, of mean time of my insertion sort, that code seems to be linear, what's going on? my code is wrong, this is what should be expected, or I'm misunderstanding the big O notation.

n = 10000-309,971.64 ns 
n = 1-32.84 ns      
n = 10-362.14 ns        
n = 100-3,527.50 ns     
n = 1000-30,895.43 ns

public static class RandomUtils
{
    public static long[] generateArray(int count)
    {
        Random random = new Random();
        long[] values = new long[count];

        for (long i = 0; i < count;   i)
            values[i] = random.Next();

        return values;
    }
}
public static class sort_insertion{
public static void insertsort(long[] data, long n)
        {
            long i, j;
            for (i = 1; i < n; i  )
            {
                long item = data[i];
                long ins = 0;
                for (j = i - 1; j >= 0 && ins != 1; )
                {
                    if (item < data[j])
                    {
                        data[j   1] = data[j];
                        j--;
                        data[j   1] = item;
                    }
                    else ins = 1;
                }
            }
        }
}

public class BE{
public long A { get; set; }   
public long[] arra;
public BE()
{
    arra = RandomUtils.generateArray(100000); // 1,10,100,...
}
[Benchmark]
public void Benchmarka() => sort_insertion.insertsort(arra, arra.Length);
}

CodePudding user response：

Doesn't [Benchmark] call that function repeatedly?

But you only initialized the array once in the constructor, so all the later calls are timing the sorted case, which is O(N) for Insertion Sort. (One of its good properties is being fast for sorted or almost-sorted input; item < data[j] is always false, so no copying happens, each outer-loop iteration does the minimum amount of work.)
Benchmark.net averages over many runs of the function, so the majority of the time comes from the linear case.

You could change your InsertionSort to read from one array, and insert into another array, so you can sort the same input repeatedly without destroying it, thus without including any time to memcpy it to an array that you sort in-place. You still do data[j 1] = data[j] to make room, then eventually do data[j] = item, it's just a matter of where you get the item to insert. From a separate array instead of from just beyond the sorted region of this array.

Sorting the same pattern repeatedly will let your CPU's branch predictor "learn" the pattern for small enough n, making it blazing fast. e.g. my i7-6700k Skylake could learn the pattern of branching in a hand-written asm Bubble Sort for something like 12 or 14 integer elements (branching for each swap), with perf stat showing branch-miss percentage below 1% IIRC. (In that case I copied fresh data to sort with a few 32-byte AVX vmovdqa copy instructions, which is easy in hand-written asm if you're playing around that way for fun and curiosity.)

But even with good or bad branch prediction, it'll still have to do O(N^2) work, and beyond a dozen or so elements, the CPU won't be able to correctly predict all the inner-loop exits anyway. But you might be able to observe a fall-off in the constant factor.

To get a consistent amount of work without randomness of the input adding noise, you could sort an array of decreasing elements, i.e. sorted in reverse order. (That might make the branch-prediction patterns simpler.)

Note that sizes of 10 and 100 are very far from infinity, so constant factors and practical considerations are a major factor.

You shouldn't be getting cache-size effects, even n=1000 only takes 4 KiB of RAM, plenty smaller than L1d cache size. (Or 8K if C# long is 64-bit?). Branch prediction might do better in big arrays, since you'd tend to have longer runs of moving branching the same way to stay in the loop. But it seems unlikely that along would come so close to balancing out the quadratic increase in amount of work.

The 1 element case is is presumably just timing overhead; 32 nanoseconds is about 120 clock cycles on a 4 GHz CPU. It doesn't take that long to do evaluate the i < n branch condition for i=1 and skip the rest of the function.

You only need to do data[j 1] = item; the last time; it can be in the else block. (Which should just be able to break out of the loop).

But that should just be increasing the constant factor of your O(N^2) Insertion Sort. I think this implementation looks correct, although using a bool ins seems clunky to me instead of a break. Or make the item < data[j] part of the inner loop condition, so you just assign to data[j] after the loop. (That potentially copies an item to itself, but that's fine.)