Given an unsorted binary array, count number of 1's, where allowed only to check if an entire s-CodePudding

Given an unsorted binary array, a, the only allowed operation is all_zeros(a), which returns True iff all of the array's elements are 0.
The complexity of this all_zeros(a) is o(len(a)) large overhead constant

I would like to find all the indices which contain 1s, in the least runs as possible of all_zeros A reasonable sub problem is to assume the number of 1s is "much" (say, x100~x1000) smaller than the number of 0s.

Theoretically, this is simply solved by iterating over the array element-wise, and testing all_zeros([element]).
In practice, the overhead constant forces us to work in as large batches as possible. We can't assume to know the ratio of 1's in the array, but if some algorithm requires that knowledge, please do share it.

I am looking for a conceptual solution, thus I am not specifying the ratio between the overhead constant and the computation time of all_zeros.

Please notice I am looking for an average case solution, and not for a worst case solution.
This now requires to define a probability distribution over 1's and 0's, but I am trying to keep this at a high level, and I will not go greatly into the details, while still keeping this answerable.
There may be a best case solution, that always gets the minimum overhead. If there is one, it will be accepted.

CodePudding user response：

I would check big chunks and only try smaller ones if they are not zero. Depending on the ratio of 1s and 'large overhead constant' I would chose a suitable start size.

Here the idea how to check (by example)

The data: (spaces only for readability)

   00001110 00100001 00100000 01000000 00000000 00000000 00000101 01010000
1. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
   -> both checked intervalls are non-zero -> half them

2. xxxxxxxxxxxxxxxxx XXXXXXXXXXXXXXXXX xxxxxxxxxxxxxxxxx XXXXXXXXXXXXXXXXX
      non-zero           non-zero          zero              non-zero

3. xxxxxxxx XXXXXXXX xxxxxxxx XXXXXXXX                   xxxxxxxx XXXXXXXX
     n-z      n-z      n-z      n-z                        n-z      n-z   

4. xxxxXXXX xxxxXXXX xxxxXXXX xxxxXXXX                   xxxxXXXX xxxxXXXX
   zero n-z n-z n-z  n-z zero n-z zero                   zero n-z n-z zero

5.     xxXX xxXXxxXX xxXX     xxXX                           xxXX xxXX 

...

I hope the idea is clear. But I highly recommend to profile with which block-size to start and when to switch for single-element blocks.

CodePudding user response：

if all_zeros(a) returns false for some subarray, then you can binary search within that subarray to find the position of the first 1. This process tells you nothing about any elements following that 1, so you would start again after that.

The question is what size to make your initial queries. You will make the fewest total number of queries if the probability of each query returning true is 50%. If your initial query has a 50% chance of finding a 1, then all the queries in the binary search will also have a 50% chance, and the total cost per 1 is log₂ L 1 queries, if 1s are L slots apart on average.

If L is twice as long as it should be, then or half as long as it should be, then the cost goes up by about 1 query per 1, which is a pretty small price to pay when 1s are far apart.

So a pretty good algorithm that doesn't require knowing the frequency of 1s to start with would be:

Set L=128, say. This is an a priori estimate of 1 frequency.
Check the first L elements. If it's all zero, then multiply L by 2 and continue with the rest of the array.
Otherwise, divide L by 2 if it's > 1, binary search to find the position of the first 1, and continue with the rest of the array after that first 1.

The total cost will be log₂ L some_small_number of queries per 1, if the 1s are randomly distributed, and I think that's the worst case.