Find duplicate in unsorted array with best time Complexity-CodePudding

I know there were similar questions, but not of such specificity

Input: n-elements array with unsorted emelents with values from 1 to (n-1). one of the values is duplicate (eg. n=5, tab[n] = {3,4,2,4,1}.

Task: find duplicate with best Complexity.

I wrote alghoritm:

int tab[] = { 1,6,7,8,9,4,2,2,3,5 };
int arrSize = sizeof(tab)/sizeof(tab[0]);

for (int i = 0; i < arrSize; i  ) {
    tab[tab[i] % arrSize] = tab[tab[i] % arrSize]   arrSize;
}

for (int i = 0; i < arrSize; i  ) {
    if (tab[i] >= arrSize * 2) {
        std::cout << i;
        break;
    }

but i dont think it is with best possible Complexity. Do You know better method/alghoritm? I can use any c library, but i don't have any idea.

Is it possible to get better complexity than O(n) ?

CodePudding user response：

In terms of big-O notation, you cannot beat O(n) (same as your solution here). But you can have better constants and simpler algorithm, by using the property that the sum of elements 1,...,n-1 is well known.

int sum = 0;
for (int x : tab) {
  sum  = x;
}

duplicate = sum - ((n*(n-1)/2))

The constants here will be significntly better - as each array index is accessed exactly once, which is much more cache friendly and efficient to modern architectures.

(Note, this solution does ignore integer overflow, but it's easy to account for it by using 2x more bits in sum than there are in the array's elements).

CodePudding user response：

Adding the classic answer because it was requested. It is based on the idea that if you xor a number with itself you get 0. So if you xor all numbers from 1 to n - 1 and all numbers in the array you will end up with the duplicate.

int duplicate = arr[0];
for (int i = 1; i < arr.length; i  ) {
    duplicate = duplicate ^ arr[i] ^ i;
}

CodePudding user response：

Don't focus too much on asymptotic complexity. In practice the fastest algorithm is not necessarily the one with lowest asymtotic complexity. That is because constants are not taken into account: O( huge_constant * N) == O(N) == O( tiny_constant * N).

You cannot inspect N values in less than O(N). Though you do not need a full pass through the array. You can stop once you found the duplicate:

#include <iostream>
#include <vector>

int main() {
    std::vector<int> vals{1,2,4,6,5,3,2};
    std::vector<bool> present(vals.size());
    for (const auto& e : vals) {
        if (present[e]) {
            std::cout << "duplicate is " << e << "\n";
            break;
        }
        present[e] = true;
    }
}

In the "lucky case" the duplicate is at index 2. In the worst case the whole vector has to be scanned. On average it is again O(N) time complexity. Further it uses O(N) additional memory while yours is using no additional memory. Again: Complexity alone cannot tell you which algorithm is faster (especially not for a fixed input size).

No matter how hard you try, you won't beat O(N), because no matter in what order you traverse the elements (and remember already found elements), the best and worst case are always the same: Either the duplicate is in the first two elements you inspect or it's the last, and on average it will be O(N).