What's the fastest way to generate a 2D grid of values in python?-CodePudding

I need to generate a 2D array in python whose entries are given by a different function above and under the diagonal.

I tried the following:

x = np.reshape(np.logspace(0.001,10,2**12),(1,4096))

def F(a,x):
  y = x.T
  Fu = np.triu(1/(2*y**2) * (y/x)**a * ((2*a 1)   (a-1)) / (a 1))
  Fl = np.tril(1/(2*y**3) * (x/y)**a * a/(2*a 1), -1)
  
  return Fu   Fl

and this works, but it's a bit too inefficient since it's computing a lot of values that are discarded anyway, some of which are especially slow due to the (x/y)**a term which leads to an overflow for high a (80 ). This takes me 1-2s to run, depending on the value of a, but I need to use this function thousands of times, so any improvement would be welcome. Is there a way to avoid computing the whole matrix twice before discarding the upper or lower triangular (which would also avoid the overflow problem), and make this function faster?

CodePudding user response：

You can move the multiplication before so to avoid multiplying a big temporary array (Numpy operations are done from the left to the right). You can also precompute (x/y)**a from (y/a)**a since it is just its inverse. Doing so is faster because computing the power of a floating-point number is slow (especially in double precision). Additionally, you can distribute the (x/y)**a operation so to compute x**a/y**a. This is faster because there is only O(2n) values to compute instead of O(n²). That being said, this operation is not numerically stable in your case because of the big power, so it is not safe. You can finally use numexpr so to compute the power in parallel using multiple threads. You can also compute the sum in-place so to avoid creating expensive temporary arrays (and more efficiently use your RAM). Here is the resulting code:

def F_opt(a,x):
    y = x.T
    tmp = numexpr.evaluate('(y/x)**a')
    Fu = np.triu(1/(2*y**2) * ((2*a 1)   (a-1)) / (a 1) * tmp)
    Fl = np.tril(1/(2*y**3) * a/(2*a 1) / tmp, -1)
    return np.add(Fu, Fl, out=Fu)

This is 5 times faster on my machine. Note there is still few warning about overflows like in the original code and an additional division by zero warning.

Note that you can make this code a bit faster using a parallel Numba code (especially if a is an integer known at compile-time). If you have access to an (expensive) server-side Nvidia GPU, then you can compute this more efficiently using the cupy package.