My code looks like this at the moment:
new_table = np.zeros(shape=(4,1),dtype=object)
for i in y:
some calculation that produce result
new_table = np.append(new_table, np.array([result]), axis=0)
After printing new_table result look like this:
array([[0],
[0],
[0],
[0],
[(1, 61.087293, 33.429379, 0.42581059018640416)],
[(1, 61.087293, 33.429379, 0.3203261022508016)],
[(1, 61.087293, 33.429379, 0.45689267865065536)]], dtype=object)
But output should be without those 4 zeros at the beginning of the array:
I am not sure what I am doing wrong, and is there possibility to add the column names to new_table and how to do this?
Thanks.
CodePudding user response:
The problem is that you generate the (4,1) array and then append more rows to it, i.e. you just add more rows. Either you start with an empty table (np.array([])
) and append to that, or you change the values in the table in place.
CodePudding user response:
Start with an empty array of the required shape. If your data is rows:
new_table = np.empty((0, 4))
for i in y:
...
new_table = np.append(new_table, np.array([result]), axis=0)
Keep in mind that this keeps reallocating the entire array over and over, which is very inefficient. You're much better off skipping the initial array, accumulating the snippets in a list, and stacking it later:
table_list = []
for ...:
table_list.append(result)
new_table = np.stack(table_list, axis=0)
CodePudding user response:
If you are working with large data sets, it might make more sense to preallocate the array and then set the values as opposed to append to a growing array / list. I compared @Mad Physicist 's solution to a different approach.
import timeit
import numpy as np
y = np.random.randint(0, 100, 10000) # dummy data
starttime1 = timeit.default_timer()
new_table = np.zeros((len(y), 4))
for idx, i in enumerate(y):
# ... some dummy operation
new_table[idx] = (i, i**2, i**3, i**4)
print(f"Preallocating : {timeit.default_timer() - starttime1} s")
table_list = []
starttime2 = timeit.default_timer()
for i in y:
table_list.append((i, i**2, i**3, i**4))
new_table = np.stack(table_list, axis=0)
print(f"np.stack : {timeit.default_timer() - starttime2} s")
It seems that the first way outperforms the second one. I didn't benchmark this properly, but I assume that the time saved is even more signifficant for larger data / arrays.
Preallocating : 0.01815319999999998 s
np.stack : 0.026264800000000033 s