What optimizations should be left for compiler?-CodePudding

Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:

v[i*3 0], v[i*3 1] and v[i*3 2] contain the components of the velocity of particle i and we want to calculate the total kinetic energy. Given that all particles are of the same mass, one may write:

inline double sqr(double x)
{
    return x*x;
}

double get_kinetic_energy(double v[], int n)
{
    double sum = 0.0;

    for (int i=0; i < n; i  )
        sum  = sqr(v[i*3 0])   sqr(v[i*3 1])   sqr(v[i*3 2]);

    return 0.5 * mass * sum;
}

To reduce the number of multiplications, it can be written as:

double get_kinetic_energy(double v[], int n)
{
    double sum = 0.0;

    for (int i=0; i < n; i  )
    {
        double *w = v   i*3;
        sum  = sqr(w[0])   sqr(w[1])   sqr(w[2]);
    }

    return 0.5 * mass * sum;
}

(one may write a function with even fewer multiplications, but that's not the point of this question)

Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?

CodePudding user response：

where should the developer rely on the compiler and where should she/he try to do some optimization manually?

Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.
Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.
When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...

From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).

CodePudding user response：

One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.

I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.

This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.