I am using pyodbc
to perform queries to a large SQL Server database. Because SQL can not hold native numpy
arrays, data is stored in regular strings representative of numpy
arrays. When I load this data to a DataFrame
, it is as follows:
Id Array1 Array2
1 -21.722315 11.017685 -23.340452 2754.642 481.94247 21.728323
...
149001 1.611342 1.526262 -35.415166 6252.124 61.51516 852.15167
However, I then want to perform operations on Array1
and Array2
, so I need to convert them to actual numpy
arrays. My current way of doing this is applying np.fromstring
to the entire dataset column.
df['Array1'] = df['Array1'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
df['Array2'] = df['Array2'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
# Elapsed Time: 9.524s
Result:
Id Array1 Array2
1 [-21.722315, 11.017685, -23.340452] [2754.642, 481.94247, 21.728323]
...
149001 [1.611342, 1.526262, -35.415166] [6252.124, 61.51516, 852.15167]
While this code works, I don't believe it is efficient nor scalable. Are there more computationally efficient ways of transforming a large amount of data in numpy
arrays?
CodePudding user response:
Using tobytes()
to get the underlying bytestring and frombuffer()
to convert back will be much more efficient. One detail is that the array dimensions are lost after tobytes()
, so you may need to resize the after frombuffer()
.