How to optimize a function in Java to make it faster?-CodePudding

public static ArrayList<Integer> duplicates(int[] arr) {
    ArrayList<Integer> doubles = new ArrayList<Integer>();
    boolean isEmpty = true;
    for(int i = 0; i<arr.length; i  ) {
        for (int j = i 1; j< arr.length; j  ) {
            if( arr[i] == arr[j] && !doubles.contains(arr[i]) ){
                doubles.add(arr[i]);
                isEmpty = false;
                break;
            }
        }
    }
    if(isEmpty) doubles.add(-1);
    Collections.sort(doubles);
    return doubles;
}

public static void main(String[] args) {
    System.out.println( ( duplicates( new int[]{1,2,3,4,4,4}  ) ) ); // Return: [4]
}

I made this function in Java which returns multiples of an input int array or returns a -1 if the input array is empty or when there are no multiples.

It works, but there is probably a way to make it faster. Are there any good practices to make functions more efficient and faster in general?

CodePudding user response：

There are, in broad strokes, 2 completely unrelated performance improvements you can make:

Reduce algorithmic complexity. This is a highly mathematical concept.
Reduce actual performance characteristics - literally, just make it run faster and/or use less memory (often, 'use less memory' and 'goes faster' go hand in hand).

The first is easy enough, but can be misleading: You can write an algorithm that does the same job in an algorithmically less complex way which nevertheless actually runs slower.

The second is also tricky: Your eyeballs and brain cannot do the job. The engineers that write the JVM itself are on record as stating that in general they have no idea how fast any given code actually runs. That's because the JVM is way too complicated: It has so many complicated avenues for optimizing how fast stuff runs (not just complicated in the code that powers such things, also complicated in how they work. For example, hotspot kicks in eventually, and uses the characteristics of previous runs to determine how best to rewrite a given method into finely tuned machine code, and the hardware you run it on also matters rather a lot).

This leads to the following easy conclusions:

Don't do anything unless there is an actual performance issue.
You really want a profiler report that actually indicates which code is 'relevant'. Generally, for any given java app, literally 1% of all of your lines of code is responsible for 99% of the load. There is just no point at all optimizing anything, except that 1%. A profiler report is useful in finding the 1% that requires the attention. Java ships with a profiler and there are commercial offerings as well.
If you want to micro-benchmark (time a specific slice of code against specific inputs), that's really difficult too, with many pitfalls. There's really only one way to do it right: Use the Java Microbenchmark Harness.
Whilst you can decide to focus on algorithmic complexity, you may still want a profiler report or JMH run because algorithmic complexity is all about 'Eventually, i.e. with large enough inputs, the algorithmic complexity overcomes any other performance aspect'. The trick is: Are your inputs large enough to hit that 'eventually' space?

For this specific algorithm, given that I have no idea what reasonable inputs might be, you're going to have to do the work on setting up JMH and or profiler runs. However, as far as algorithmic complexity goes:

That doubles.contains call has O(N) algorithmic complexity: The amount of time that call takes is linear relative to how large your inputs are.

You can get O(1) algorithmic complexity if you use a HashSet instead.

From the point of view of just plain performance, generally an ArrayList's performance and memory load vs. an int[] is quite large.

This gives 2 alternate obvious strategies to optimize this code:

Replace the ArrayList<Integer> with an int[].
Replace the ArrayList<integer> with a HashSet<Integer> instead.

You can't really combine these two, not without spending a heck of a long time handrolling a primitive int array backed hashbucket implementation. Fortunately, someone did the work for you: Eclipse Collections has a primitive int hashset implementation.

Theoretically it's hard to imagine how replacing this with IntHashSet can be slower. However, I can't go on record and promise you that it'll be any faster: I can imagine if your input is an int array with a few million ints in there, IntHashSet is probably going to be many orders of magnitude faster. But you really need test data and a profiler report and/or a JMH run or we're all just guessing, which is a bad idea, given that the JVM is such a complex beast.

So, if you're serious about optimizing this:

Write a bunch of test cases.
Write a wrapper around this code so you can run those tests in a JMH setup.
Replace the code with IntHashSet and compare that vs. the above in your JMH harness.
If that really improves things and the performance now fits your needs, great. You're done.
If not, you may have to re-evaluate where and how you use this code, or if there's anything else you can do to optimize things.

CodePudding user response：

It works, but there is probably a way to make it faster.

I think you will find this approach significantly faster. I omitted the sort from both methods just to check. This does not discuss general optimizations as rzwitserloot's excellent answer already does that.

The two main problems with your method are:

you are using a nested loop which is essentially is an O(N*N) problem.
and you use contains on a list which must do a linear search each time to find the value.

A better way is to use a HashSet which works very close to O(1) lookup time (relatively speaking and depending on the set threshold values).

The idea is as follows.

Create two sets, one for the result and one for what's been seen.
iterate over the array
- try to add the value to the seen set, if it returns true, that means a duplicate is not in the seen set so it is ignored.
- if it returns false, a duplicate does exist in the seen set so it is added to the duplicate set.
- Note the use of the bang ! to invert the above conditions.
once the loop is finished, return the duplicates in a list as required.

public static List<Integer> duplicatesSet(int[] arr) {
    Set<Integer> seen = new HashSet<>();
    Set<Integer> duplicates = new HashSet<>();
    for (int v : arr) {
        if (!seen.add(v)) {
            duplicates.add(v);
        }
    }
    return duplicates.isEmpty() 
            ? new ArrayList<>(List.of(-1))
            : new ArrayList<>(duplicates);
        
}

The sort is easily added back in. That will take additional computing time but that was not the real problem.

To test this I generated a list of random values and put them in an array. The following generates an array of 1,000,000 ints between 1 and 1000 inclusive.

Random r = new Random();
int[] val = r.ints(1_000_000, 1, 1001).toArray();