Home > Mobile >  Efficiently convert dataframe rows to numpy arrays
Efficiently convert dataframe rows to numpy arrays

Time:09-27

I am using pyodbc to perform queries to a large SQL Server database. Because SQL can not hold native numpy arrays, data is stored in regular strings representative of numpy arrays. When I load this data to a DataFrame, it is as follows:

Id       Array1                             Array2
1        -21.722315 11.017685 -23.340452    2754.642 481.94247 21.728323
...
149001   1.611342 1.526262 -35.415166       6252.124 61.51516 852.15167

However, I then want to perform operations on Array1 and Array2, so I need to convert them to actual numpy arrays. My current way of doing this is applying np.fromstring to the entire dataset column.

df['Array1'] = df['Array1'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
df['Array2'] = df['Array2'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
# Elapsed Time: 9.524s

Result:

Id       Array1                               Array2
1        [-21.722315, 11.017685, -23.340452]  [2754.642, 481.94247, 21.728323]
...
149001   [1.611342, 1.526262, -35.415166]     [6252.124, 61.51516, 852.15167]

While this code works, I don't believe it is efficient nor scalable. Are there more computationally efficient ways of transforming a large amount of data in numpy arrays?

CodePudding user response:

Using tobytes() to get the underlying bytestring and frombuffer() to convert back will be much more efficient. One detail is that the array dimensions are lost after tobytes(), so you may need to resize the after frombuffer().

  • Related