Python big_o seems to return completely incorrect results

I am comparing runtimes of different ways of flattening a list of lists using the big_o module, and for following methods my function does not return the expected results, namely:

This one:


    def itertools_chain_from_iterable(arr): 
        return list(chain.from_iterable(arr))

returns "constant", which can't possibly be true.

This one:

    def merge_extend(self,arr):
        output = []
        for l in arr:
            output.extend(l)
        return output

returns "cubic" (shouldn't it be quadratic at most?), while...

..this one

    def merge_w_sum(self,arr): 
        return sum(arr,[])

returns "linear" (I'm quite sure it should be quadratic, see proof here.

Furthermore, the list comprehension one

    def merge_bucket(self,bucket):
        return [number for n in bucket for number in n]

returns "polynomial", which seems terrifying (would expect linear here as well)

Code used to calculate the complexities:

print('<function name>:', big_o.big_o(<function name>, 
       lambda n:[big_o.datagen.integers(9900,1,9999999) for n in range(50)],
       n_measures=20)[0])

Output:

complexity of itertools_chain_from_iterable: Constant: time = 0.0013 (sec)
complexity of merge_w_sum: Linear: time = 0.46   6.2E-07*n (sec)
complexity of merge_extend: Cubic: time = 0.048   -2.3E-18*n^3 (sec)
complexity of merge_bucket: Polynomial: time = 0.2 * x^-0.019 (sec)

What is it that I'm doing (or understanding) wrong? Many thanks in advance for useful tips!

CodePudding user response：

Your lambda ignores its argument n and instead always produce the same constant-size input. Produce input of size n instead.

(A note: originally the question had 8 functions and 7 of them were judged "constant time". It was edited to a larger constant and then got other judgements. I guess your computer's speed is somewhat unstable, as the constant input should still lead to constant runtimes and thus "constant time" judgements. In any case, the input-generating function should be fixed like I said.)

Given that it's two-dimensional, you could for example produce n lists of size 1, one list of size n, or sqrt(n) lists of size sqrt(n). Presumably n lists of size 1 is what you want if your goal is to demonstrate that sum(arr, []) is bad. For example:

lambda n: [[i] for i in range(n)]

A full program:

import big_o
from itertools import chain

def chained(arr):
    return list(chain.from_iterable(arr))

def merge_extend(arr):
    output = []
    for l in arr:
        output.extend(l)
    return output

def merge_w_sum(arr): 
    return sum(arr,[])

def merge_bucket(bucket):
    return [number for n in bucket for number in n]

funcs = [
    (chained, 10**5),
    (merge_extend, 10**5),
    (merge_w_sum, 10**4),
    (merge_bucket, 10**5),
]

for _ in range(3):
    for func, max_n in funcs:
        complexity = big_o.big_o(
            func,
            lambda n: [[0]] * n,
            max_n=max_n,
            n_timings=10,
        )[0]
        print(
            f'{func.__name__:13}',
            complexity
        )
    print()

Sample results:

chained       Linear: time = 8.2E-05   5.8E-08*n (sec)
merge_extend  Linear: time = -4.2E-06   8.4E-08*n (sec)
merge_w_sum   Quadratic: time = 0.00013   2.4E-09*n^2 (sec)
merge_bucket  Linear: time = 0.00046   8E-08*n (sec)

chained       Logarithmic: time = -0.0075   0.0014*log(n) (sec)
merge_extend  Linear: time = -2E-05   8.5E-08*n (sec)
merge_w_sum   Quadratic: time = 0.00011   2.4E-09*n^2 (sec)
merge_bucket  Linear: time = -4.2E-05   2.6E-07*n (sec)

chained       Linear: time = -1.8E-05   1.6E-07*n (sec)
merge_extend  Logarithmic: time = -0.01   0.0019*log(n) (sec)
merge_w_sum   Quadratic: time = 8.3E-05   2.4E-09*n^2 (sec)
merge_bucket  Linear: time = 7.1E-05   8.3E-08*n (sec)

You can see it gets it right most of the time, but still sometimes thinks logarithmic instead of linear. On a more stable system it might work better. Larger max_n values should help, too, but I tried and then big_o crashed with some known internal error.

Also note that I used different max_n values for the different functions. It's no good to use the same for all. If you use the same for all, then if it's large, a quadratic time solution takes unbearably long, and if it's small, a linear time solution takes so little time that big_o has trouble differentiating it from logarithmic or linearithmic. There doesn't seem to be a medium value that's good for all. Ideally, big_o would automatically adjust max_n appropriately, but I don't think it supports that.

CodePudding user response：

First it's important to understand that Big O notation is about estimating the asymptotic growth of functions.
I.e. how they "behave" when input size grows to infinity.

E.g. if an algorithm is said to run in O(n^2), it means (roughly) that when the input size (n) grows to infinity, the execution time will be bound by a constant multiplied by n^2 (for a more accurate definition see the link above).

Therefore the "real" big-O has nothing to do with concrete time measurements (in particular because we cannot measure infinite input) - at least in the theoretical sense.

Libraries like pypi/big-O try to estimate how the processing time grows when input size grows. But it is only an estimation and depends heavily on the properties of the time function, and size of input you actually give it.

In your case, it seems like you gave it too small input. This resulted in very short execution times that caused the estimation method to fail. You can try to increase the input size significantly.

Update: the last paragraph (regarding the reason the input is too small) was a guess that turned out to be wrong - see @KellyBundy's answer above.