Saving data after loop with numpy-CodePudding

My code looks like this at the moment:

new_table = np.zeros(shape=(4,1),dtype=object) 

for i in y:   
    some calculation that produce result
    new_table = np.append(new_table, np.array([result]), axis=0)

After printing new_table result look like this:

array([[0],
       [0],
       [0],
       [0],
       [(1, 61.087293, 33.429379, 0.42581059018640416)],
       [(1, 61.087293, 33.429379, 0.3203261022508016)],
       [(1, 61.087293, 33.429379, 0.45689267865065536)]], dtype=object)

But output should be without those 4 zeros at the beginning of the array:

I am not sure what I am doing wrong, and is there possibility to add the column names to new_table and how to do this?

Thanks.

CodePudding user response：

The problem is that you generate the (4,1) array and then append more rows to it, i.e. you just add more rows. Either you start with an empty table (np.array([])) and append to that, or you change the values in the table in place.

CodePudding user response：

Start with an empty array of the required shape. If your data is rows:

new_table = np.empty((0, 4)) 
for i in y:   
    ...
    new_table = np.append(new_table, np.array([result]), axis=0)

Keep in mind that this keeps reallocating the entire array over and over, which is very inefficient. You're much better off skipping the initial array, accumulating the snippets in a list, and stacking it later:

table_list = []
for ...:
    table_list.append(result)
new_table = np.stack(table_list, axis=0)

CodePudding user response：

If you are working with large data sets, it might make more sense to preallocate the array and then set the values as opposed to append to a growing array / list. I compared @Mad Physicist 's solution to a different approach.

import timeit
import numpy as np

y = np.random.randint(0, 100, 10000)    # dummy data

starttime1 = timeit.default_timer()
new_table = np.zeros((len(y), 4))

for idx, i in enumerate(y):
    # ... some dummy operation
    new_table[idx] = (i, i**2, i**3, i**4)

print(f"Preallocating : {timeit.default_timer() - starttime1} s")

table_list = []
starttime2 = timeit.default_timer()

for i in y:
    table_list.append((i, i**2, i**3, i**4))
new_table = np.stack(table_list, axis=0)

print(f"np.stack : {timeit.default_timer() - starttime2} s")

It seems that the first way outperforms the second one. I didn't benchmark this properly, but I assume that the time saved is even more signifficant for larger data / arrays.

Preallocating : 0.01815319999999998 s
np.stack : 0.026264800000000033 s