I have some multidimensional data and was wondering if i should use xarray when speed is one, albeit not highest, of my concerns.
I have a 4D array so it's not so big as to preclude me from using numpy. The coordinates/indices are vital for one dimension but not so for all the others. I'll have to do slight book-keeping but as the primary developer, this is ok for me. For the developers who continue to iterate the code after me, using integer indexing might be more confusing than using a label-based (xarray/pandas) approach. Regardless, i can still use numpy if i document the process well. But i would like to use xarray for readability.
After implementing a solution, i noticed that the operations/indexing below will complete in about 5 seconds on my machine.
for isotope in isotopes:
for height in heights:
for assm in assemblies:
da.loc[dict(power=['NW','NE','SW','SE'],
assembly=assm,
height=height,
isotope=isotope)] = [3,5,1,20]
If i do the same thing in an integer-based approach on xarray, it takes about 2 seconds.
for k,isotope in enumerate(isotopes):
for j,height in enumerate(heights):
for i,assm in enumerate(assemblies):
da[i,[-4,-3,-2,-1],j,k] = [3,5,1,20]
Lastly, i noticed that if i do the same integer-based indexing in numpy, it takes less than half a second
arr = np.zeros((44,10,22,13))
for k,isotope in enumerate(isotopes):
for j,height in enumerate(heights):
for i,assm in enumerate(assemblies):
arr[i,[-4,-3,-2,-1],j,k] = [3,5,1,20]
Speed is not my biggest concern here but if the label-based approach in xarray is more than 8 times slower and the integer-based approach in xarray is 4 times slower than the standard numpy integer-based approach, it's dissuading me from digging deeper into xarray for medium-rank multidimensional data.
Any thoughts, advice, etc?
CodePudding user response:
we can't really tell you which package to use, and certainly can't without knowing much more about your data and use case.
For what it's worth, while the performance of xarray will always lag compared to numpy, the difference is most acute when performing small operations like this. You are assigning a tiny amount of data, using indexing, within a triple for loop, which is kryptonite for xarray. If you make all assignments at the same time, you'll see the penalty decrease meaningfully as the indexing overhead becomes less significant relative to the underlying numpy operations. Performance in xarray is all about understanding how to minimize the overhead and leverage the performance of the backend as much as possible while still giving the convenience of labels-based indexing.
See this simple example. I've created a 3-D DataArray with 1 million float64s, indexed by (x, y, z):
In [11]: da = xr.DataArray(
...: np.random.random(size=(100, 100, 100)),
...: dims=list('xyz'),
...: coords=[pd.Index([f'{d}{i}' for i in range(100)], name=d) for d in 'xyz'],
...: )
Looping through x and y and then assigning along the first four elements of z
incurs a huge penalty, with xarray coming in at just over 100x numpy's runtime for the same operation:
In [12]: %%time
...: for xi, x in enumerate(da.x.values):
...: for yi, y in enumerate(da.y.values):
...: da.loc[{'x': x, 'y': y, 'z': ['z0', 'z1', 'z2', 'z3']}] = [1, 2, 3, 4]
...:
CPU times: user 2.96 s, sys: 38.3 ms, total: 3 s
Wall time: 2.97 s
In [13]: %%time
...: for xi, x in enumerate(da.x.values):
...: for yi, y in enumerate(da.y.values):
...: da.values[xi, yi, :4] = [1, 2, 3, 4]
...:
CPU times: user 25.7 ms, sys: 508 µs, total: 26.3 ms
Wall time: 25.8 ms
If the same operation is restructured to assign all elements at once, xarray's performance penalty decreases to be about 6x the numpy runtime.
In [15]: %%time
...: da.loc[{'z': ['z0', 'z1', 'z2', 'z3']}] = np.tile([1, 2, 3, 4], (100, 100, 1))
...:
...:
CPU times: user 1.4 ms, sys: 675 µs, total: 2.07 ms
Wall time: 2.99 ms
In [16]: %%time
...: da.values[:, :, :4] = np.tile([1, 2, 3, 4], (100, 100, 1))
...:
...:
CPU times: user 488 µs, sys: 222 µs, total: 710 µs
Wall time: 428 µs
Assigning the whole array reduces xarray's overhead to about 2x:
In [19]: %%time
...: da.loc[{'z': da.z}] = np.tile(np.random.random(100), (100, 100, 1))
...:
...:
CPU times: user 11.2 ms, sys: 9.43 ms, total: 20.7 ms
Wall time: 20.9 ms
In [20]: %%time
...: da.values[:, :, :] = np.tile(np.random.random(100), (100, 100, 1))
...:
...:
CPU times: user 3.08 ms, sys: 4.61 ms, total: 7.7 ms
Wall time: 6.72 ms
It's up to you whether this is worth the cost. But whichever you choose, don't use nested for loops for assignment :)