Home > Blockchain >  Optimizing the creation of a non-numeric matrix with NumPy
Optimizing the creation of a non-numeric matrix with NumPy

Time:05-26

As I was trying to find a way to optimize the creation and printing of a huge 2D matrix, I decided to try out NumPy. But, unfortunately for me, using this library on the contrary makes the situation worse. My goal is to create a matrix that will be filled with strings with its index. Something like this (where n is size of matrix):

python_matrix = [[f"{y}, {x}" for x in range(n)] for y in range(n)]

And when I used the array() function of the NumPy library this way:

numpy_matrix = numpy.array([[f"{y}, {x}" for x in range(n)] for y in range(n)])

the time to create the matrix only increased. For example, for n = 1000: python_matrix is created by 0.032 sec, and numpy_matrix by 0.419, that is longer than python by 13 times

Also, numpy_matrix prints slower (if you output the full version, not the shortened version), than it python_matrix does using for cycle

n = 1000
def numpy_matrix(n):
    matrix = numpy.array([[f"{y}, {x}" for x in range(n)] for y in range(n)])
    with numpy.printoptions(threshold=numpy.inf):
        print(coordArr)
def python_matrix(n):
    matrix = [[f"{y}, {x}" for x in range(n)] for y in range(n)]
    def print_matrix():
        for arr in matrix:
            print(arr)
    print_matrix()
# time of numpy_matrix > time of python_matrix

  1. Is it better to use the standard Python features, or is NumPy actually more efficient and I just didn't use it correctly?
  2. Also, if I do use NumPy, the question of how I can speed up the output of the full version of the matrix remains

CodePudding user response:

Running an ipython session and using its timeit, I don't get such large differences:

Making the list:

In [13]: timeit [[f"{y}, {x}" for y in range(N)] for x in range(N)]
492 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Making the array (while also making the list):

In [14]: timeit np.array([[f"{y}, {x}" for y in range(N)] for x in range(N)])
779 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Removing the list creation step from the time:

In [15]: %%timeit alist = [[f"{y}, {x}" for y in range(N)] for x in range(N)]
    ...: np.array(alist)
    ...: 
    ...: 
313 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So creating the array from an existing list isn't that much longer.

Specifying the dtype helps a bit as well:

In [18]: %%timeit alist = [[f"{y}, {x}" for y in range(N)] for x in range(N)]
    ...: np.array(alist, dtype='U8')
    ...: 
    ...: 
224 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing prints is more awkward, though we could just time the string formatting, str(x). I won't show the times, but yes, the array formatting is much slower. Essentially numpy has to go through Python's own string handling code; it has little of its own.

numeric list/array

For a numeric array, the relative difference is bigger:

In [29]: alist = [[(x,y) for y in range(N)] for x in range(N)]

In [30]: arr = np.array(alist)

In [31]: arr.shape
Out[31]: (1000, 1000, 2)

In [32]: timeit alist = [[(x,y) for y in range(N)] for x in range(N)]
171 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: timeit arr = np.array(alist)
832 ms ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However if make the same array with array methods - i.e. not via list of lists, the time is much better:

In [40]: timeit np.stack(np.broadcast_arrays(np.arange(N)[:,None], np.arange(N)),axis=2)
8.51 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy is not an improvement over lists in all ways. It is best for math on existing arrays. Creating an array from lists is time consuming. And it doesn't add a whole lot to string handling.

  • Related