Data structure for sparse n-dimensional array / tensor such A[0, :, :] and A[1, :, :] can have diffe-CodePudding

With Python/Numpy, I'm working with n-dimensional data (ideally in a ndarray) such that:

(1) for example A[0, :, :], ..., A[49, :, :]can be of shape 100x100, and A[50, :, :] could be of shape 10000x10000: I don't want to create a ndarray of shape (..., 10000, 10000) because it would be a waste of space: A[n, :, :] only contains 100x100 coefficients for n = 0 .. 49.
(2) I also would like to be able to work with B[: , :, :, :, 202206231808], B[:, :, :, :, 19700101000000], i.e. the last dimension would be a numerical timestamp in format YYYMMDDhhmmss or more generally an integer label (not in a continuous 0 .. n-1 range)
(3) All of this should keep (as much as possible) all the standard Numpy operations such as B.mean(axis=4) to average the data on all timestamps, and similar useful Numpy operations, etc.

What is the right data structure for such a n-dimensional array, with Numpy or Pandas? (NB: I will have probably 5 or 6 dimensions)

CodePudding user response：

As far as I can know, you can use a non-consistent dimensional data structure like the one that you want with tensorflow.ragged.constant() as follows:

import numpy as np
import tensorflow as tf

l1 = tf.ragged.constant([[0, 1, 0], [1, 1]])
print(l1)

The main advantage of using the TensorFlow library here is that you can convert your tensors to NumPy arrays with an easy instruction your_tensor.numpy().

CodePudding user response：

It looks like you're looking for xarray. Indices in numpy are purely positional. You can't have an index in numpy be a timestamp, because the first index is always 0, and the last index is always len(axis) - 1.

xarray uses n-dimensional arrays as a computational engine, but adds the concept of labeled indexing from pandas. It's a NumFOCUS-supported project with a lot of users and growing tie-ins to pandas, numpy, and dask (for distributed processing). You can easily create an ND-Array with e.g. datetime coordinates (dimension labels) and select using these labels. You can also use the sparse package's COO arrays as a backend if desired.

See the quickstart for an introduction.

For example, you can create an array from a numpy NDArray, but add dimension names and coordinate labels:

import xarray as xr, numpy as np, pandas as pd

da = xr.DataArray(
    np.random.random(size=(10, 10, 100)),
    dims=['x', 'y', 'time'],
    coords=[
        range(10),
        range(-100, 0, 10),
        pd.date_range('2022-06-23 18:08', periods=100, freq='s'),
    ],
)

Here's what this looks like displayed:

In [3]: da
Out[3]:
<xarray.DataArray (x: 10, y: 10, time: 100)>
array([[[5.20920842e-01, 4.69121072e-01, 6.40222454e-01, ...,
         2.99971293e-01, 2.62265561e-01, 6.35366406e-01],
        ...,
        [2.67650196e-01, 1.83472873e-01, 9.28958673e-01, ...,
         2.54365478e-01, 5.31364961e-01, 7.64313509e-01]],
...

       [[4.36503680e-01, 6.04280469e-01, 3.74281880e-01, ...,
         9.41795201e-03, 2.45035315e-01, 4.36213072e-01],
        ...,
        [2.70554857e-01, 9.81791362e-01, 3.67033886e-01, ...,
         2.37171168e-01, 3.92829137e-01, 1.18888502e-02]]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  * y        (y) int64 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10
  * time     (time) datetime64[ns] 2022-06-23T18:08:00 ... 2022-06-23T18:09:39

The underlying array is still numpy:

In [4]: type(da.data)
Out[4]: numpy.ndarray

You can select along dimensions positionally, or by label using .sel:

In [5]: da.sel(time='2022-06-23T18:09:01')
Out[5]:
<xarray.DataArray (x: 10, y: 10)>
array([[0.61802968, 0.44798696, 0.53146839, 0.54672015, 0.52251633,
        0.69215547, 0.84386726, 0.72421072, 0.87467204, 0.87845358],
       [0.22257334, 0.32035713, 0.08175992, 0.34816822, 0.84258207,
        0.80708575, 0.02339722, 0.1904887 , 0.77412369, 0.34198665],
       [0.4987155 , 0.05057836, 0.11611118, 0.95652761, 0.88992791,
        0.15960549, 0.31591357, 0.77504342, 0.04418024, 0.02722908],
       [0.76613849, 0.88007545, 0.27904722, 0.56225594, 0.39773015,
        0.23494531, 0.54437166, 0.41985857, 0.92803277, 0.63992328],
       [0.00981116, 0.2688392 , 0.17421749, 0.45761431, 0.74987955,
        0.8115907 , 0.42623655, 0.9660985 , 0.25014544, 0.47767839],
       [0.21176705, 0.17295334, 0.25520267, 0.17743549, 0.10468529,
        0.48232753, 0.55139512, 0.9658701 , 0.52430646, 0.99446656],
       [0.83707974, 0.07546811, 0.70503445, 0.62984982, 0.5956393 ,
        0.93147836, 0.97454177, 0.92595764, 0.4889221 , 0.59362206],
       [0.04210777, 0.56803518, 0.78362288, 0.54106628, 0.09178342,
        0.63581206, 0.03913531, 0.43868853, 0.22767441, 0.86995461],
       [0.88047   , 0.86284775, 0.26553173, 0.06123448, 0.55392798,
        0.44922685, 0.18933487, 0.16720496, 0.40440954, 0.79741338],
       [0.22714674, 0.76756767, 0.08131078, 0.64319224, 0.39983711,
        0.792     , 0.32000998, 0.42772083, 0.19313205, 0.35174807]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  * y        (y) int64 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10
    time     datetime64[ns] 2022-06-23T18:09:01

Alignment in xarray is done by dimension name rather than axis order, so there's no reason to have an array with shape (1, 1, 1, 1, 1000). Instead, just ensure that dimension names are consistent across your arrays, and two arrays with shared dimension names will be broadcast against each other correctly. See the docs on computation: automatic alignment for more info.