Data structure for n-dimensional array / tensor such A[0, :, :] and A[1, :, :] can have different sh-CodePudding

With Python/Numpy, I'm working with n-dimensional data (ideally in a ndarray) such that:

(1): ragged array

For example A[0, :, :], ..., A[49, :, :] can be of shape 100x100, and A[50, :, :] could be of shape 10000x10000: I don't want to create a ndarray of shape (..., 10000, 10000) because it would be a waste of space: A[n, :, :] only contains 100x100 coefficients for n = 0 .. 49.
(2): labeled indexing instead of positional indexing

I also would like to be able to work with B[:, :, :, :, 202206231808], B[:, :, :, :, 19700101000000], i.e. the last dimension would be a numerical timestamp in format YYYYMMDDhhmmss or more generally an integer label (not in a continuous 0 .. n-1 range)
(3): easy Numpy-like arithmetic

All of this should keep (as much as possible) all the standard Numpy operations such as B.mean(axis=4) to average the data on all timestamps, and similar useful Numpy operations, etc.

(4): serialization / random access

We should be able to save 100 GB of data in such a data structure, to disk. Then the next day, if we want to only modify a few values, we should be able to save it on disk without rewriting the full 100 GB file:

x = datastore.open('datastore.dat')                              # open the data store, *without* loading everything in memory
x[20220624000000, :, :, :] = 0                                   # modify some values
x[20220510120000, :, :, :] -= x[20220510120000, :, :, :].mean()  # modify other values
x.close()                                                        # only a few bytes written to disk

What is the right data structure for such a n-dimensional array, with Numpy or Pandas? (NB: I will have probably 5 or 6 dimensions)

CodePudding user response：

As far as I can know, you can use a non-consistent dimensional data structure like the one that you want with tensorflow.ragged.constant() as follows:

import numpy as np
import tensorflow as tf

l1 = tf.ragged.constant([[0, 1, 0], [1, 1]])
print(l1)

The main advantage of using the TensorFlow library here is that you can convert your tensors to NumPy arrays with an easy instruction your_tensor.numpy().

CodePudding user response：

It looks like you're looking for xarray. Indices in numpy are purely positional. You can't have an index in numpy be a timestamp, because the first index is always 0, and the last index is always len(axis) - 1.

xarray uses n-dimensional arrays as a computational engine, but adds the concept of labeled indexing from pandas. It's a NumFOCUS-supported project with a lot of users and growing tie-ins to pandas, numpy, and dask (for distributed processing). You can easily create an ND-Array with e.g. datetime coordinates (dimension labels) and select using these labels. You can also use the sparse package's COO arrays as a backend if desired.

See the quickstart for an introduction.

For example, you can create an array from a numpy NDArray, but add dimension names and coordinate labels:

import xarray as xr, numpy as np, pandas as pd

da = xr.DataArray(
    np.random.random(size=(10, 10, 100)),
    dims=['x', 'y', 'time'],
    coords=[
        range(10),
        range(-100, 0, 10),
        pd.date_range('2022-06-23 18:08', periods=100, freq='s'),
    ],
)

Here's what this looks like displayed:

In [3]: da
Out[3]:
<xarray.DataArray (x: 10, y: 10, time: 100)>
array([[[5.20920842e-01, 4.69121072e-01, 6.40222454e-01, ...,
         2.99971293e-01, 2.62265561e-01, 6.35366406e-01],
        ...,
        [2.67650196e-01, 1.83472873e-01, 9.28958673e-01, ...,
         2.54365478e-01, 5.31364961e-01, 7.64313509e-01]],
...

       [[4.36503680e-01, 6.04280469e-01, 3.74281880e-01, ...,
         9.41795201e-03, 2.45035315e-01, 4.36213072e-01],
        ...,
        [2.70554857e-01, 9.81791362e-01, 3.67033886e-01, ...,
         2.37171168e-01, 3.92829137e-01, 1.18888502e-02]]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  * y        (y) int64 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10
  * time     (time) datetime64[ns] 2022-06-23T18:08:00 ... 2022-06-23T18:09:39

The underlying array is still numpy:

In [4]: type(da.data)
Out[4]: numpy.ndarray

You can select along dimensions positionally, or by label using .sel:

In [5]: da.sel(time='2022-06-23T18:09:01')
Out[5]:
<xarray.DataArray (x: 10, y: 10)>
array([[0.61802968, 0.44798696, 0.53146839, 0.54672015, 0.52251633,
        0.69215547, 0.84386726, 0.72421072, 0.87467204, 0.87845358],
       [0.22257334, 0.32035713, 0.08175992, 0.34816822, 0.84258207,
        0.80708575, 0.02339722, 0.1904887 , 0.77412369, 0.34198665],
       [0.4987155 , 0.05057836, 0.11611118, 0.95652761, 0.88992791,
        0.15960549, 0.31591357, 0.77504342, 0.04418024, 0.02722908],
       [0.76613849, 0.88007545, 0.27904722, 0.56225594, 0.39773015,
        0.23494531, 0.54437166, 0.41985857, 0.92803277, 0.63992328],
       [0.00981116, 0.2688392 , 0.17421749, 0.45761431, 0.74987955,
        0.8115907 , 0.42623655, 0.9660985 , 0.25014544, 0.47767839],
       [0.21176705, 0.17295334, 0.25520267, 0.17743549, 0.10468529,
        0.48232753, 0.55139512, 0.9658701 , 0.52430646, 0.99446656],
       [0.83707974, 0.07546811, 0.70503445, 0.62984982, 0.5956393 ,
        0.93147836, 0.97454177, 0.92595764, 0.4889221 , 0.59362206],
       [0.04210777, 0.56803518, 0.78362288, 0.54106628, 0.09178342,
        0.63581206, 0.03913531, 0.43868853, 0.22767441, 0.86995461],
       [0.88047   , 0.86284775, 0.26553173, 0.06123448, 0.55392798,
        0.44922685, 0.18933487, 0.16720496, 0.40440954, 0.79741338],
       [0.22714674, 0.76756767, 0.08131078, 0.64319224, 0.39983711,
        0.792     , 0.32000998, 0.42772083, 0.19313205, 0.35174807]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9
  * y        (y) int64 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10
    time     datetime64[ns] 2022-06-23T18:09:01

Alignment in xarray is done by dimension name rather than axis order, so there's no reason to have an array with shape (1, 1, 1, 1, 1000). Instead, just ensure that dimension names are consistent across your arrays, and two arrays with shared dimension names will be broadcast against each other correctly. See the docs on computation: automatic alignment for more info.

CodePudding user response：

The sparse data structure might be useful as well. Example:

import numpy as np
import sparse
s3 = sparse.DOK((1000, 1000, 10, 300000000000), fill_value=np.nan)  # allowing timestamps
                                                                    # up to year 3000!
s3[:500, :200, 0, 202206240936] = np.random.rand(500, 200)
print(s3[:, :, 0, 202206240936])
print(s3[:, :, 0, 202206240936].todense())  # the  500x200 matrix above, completed with NaN
print(s3[:, :, 0, 197001010000].todense())  # only NaN because not defined
print(s3.nbytes)
sparse.save_npz('test.npz', s3)  # Here it takes ~ 2MB for 100k float64
                                 #                  which would normally take 0.8 MB

Only drawback: I'm not sure how it's possible to save many data in this (for example 100GB of data), then the next day open this structure, only modify a few values here and there, and save it back to disk, without having to rewrite the whole 100 GB file... (comments are welcome).