Home > database >  Best way to retrieve K largest elements from large unsorted arrays?
Best way to retrieve K largest elements from large unsorted arrays?

Time:07-19

I recently had a coding test during an interview. I was told:

There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?

During this, I was strongly hinted that I needed to sort the array. So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).

I then reviewed this link on SO that suggests priority queue of K numbers, removing smallest number every time a larger element is found which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers

Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?

CodePudding user response:

Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:

  1. Find the kth largest number x using Quickselect (O(n))
  2. Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
  3. Return your saved elements

(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)

The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.

CodePudding user response:

There is a large unsorted array of one million ints. User wants to retrieve K largest elements.

During this, I was strongly hinted that I needed to sort the array.

That wasn't really a hint I guess, rather a sort of trick to deceive you (to test how strong your knowledge is).

If you choose to approach the problem by sorting the whole source array, you can't obtain time complexity better than O(n log n).

Instead, we can maintain a PriorytyQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it, if it is greater the lowest element has to be removed and the new element needs to be added.

The time complexity of this approach would be O(n log k) because adding a new element into the PriorytyQueue of size k costs O(k) and in the worst case scenario this operation can be performed n times (because we're iterating over the array of size n).

So the difference between sorting and using a PriorytyQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.

Here's an implementation:

public static int[] getHighestK(int[] arr, int k) {
    Queue<Integer> queue = new PriorityQueue<>();
    
    for (int next: arr) {
        if (queue.size() == k && queue.peek() < next) queue.remove();
        if (queue.size() < k) queue.add(next);
    }
    
    return toIntArray(queue);
}

public static int[] toIntArray(Collection<Integer> source) {
    return source.stream().mapToInt(Integer::intValue).toArray();
}

main()

public static void main(String[] args) {
    System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}

Output:

[9, 12, 27]

CodePudding user response:

Is is just an idea. if it need to be really fast I will think for something like make create array (int) with 1 bill size then for every number in for-each that I get from the original array just put the same index (as the number) 1 inside the empty array that I created. So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array). So to find the 100 biggest elements you can make backward for and count back from 100 to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times. The limitation of this approach is that it works only with integers because of the nature of the array...

CodePudding user response:

I think you misunderstood what you needed to sort.

You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)).

The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(K)).

For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.

  • Related