Best way to transform NxD timeseries dataset to (N-T 1)xTxD?-CodePudding

I couldn't come up with a better title unfortunately; I acknowledge that the fact that I couldn't explain it better has likely hindered my ability to search for an already established answer to this.

So, I have a timeseries dataset with N1 rows and D columns. A recurrent neural network needs the data in an N2xTxD format, so that if the sequence length T is 2, then the 1st element of the new N2xTxD dataset ds2[0] will be the original dataset's first 2 rows ds[0:2, :]. The second element ds2[1] will be ds[1:3, :] and so on until ds2[N2] = ds[N-2:N, :]

The way I do it now is using these functions:

import numpy as np

#Shift Array arr's elements by num positions
def NpShift(arr, num, fill_value = np.nan):
    result = np.empty_like(arr)
    result[:num] = fill_value
    result[num:] = arr[:-num]
    return result


def TemporalTransformation(ds, T):
    tmp = ds
    ds = ds.reshape(-1, 1, ds.shape[1]) #By definition ds is NxD, so Nx1xD is -1x1xshape[1]
    
    for t in range(T):
        ds = np.concatenate((NpShift(tmp, t 1)[:, np.newaxis, :], ds), axis = 1) #Adding the shifted matrices one by one
    ds = ds[T-1:, 1:, :] #The 1st T-1 elements contain the shifted values so they have to be discarded; same goes for the 1st element on axis=1
    
    return ds

you can test it to see that the results are correct using:

t = 2
xall = np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4], [5,5,5]], dtype = float)
print(f"ds shape:\n{xall.shape}")
print(f"ds:\n{xall}\n")
ds2 = TemporalTransformation(xall, t)
print("ds2 shape:\n", ds2.shape)
print(f"ds2:\n{ds2}")

which outputs:

ds shape:
(5, 3)
ds:
[[1. 1. 1.]
 [2. 2. 2.]
 [3. 3. 3.]
 [4. 4. 4.]
 [5. 5. 5.]]

ds2 shape:
 (4, 2, 3)
ds2:
[[[1. 1. 1.]
  [2. 2. 2.]]

 [[2. 2. 2.]
  [3. 3. 3.]]

 [[3. 3. 3.]
  [4. 4. 4.]]

 [[4. 4. 4.]
  [5. 5. 5.]]]

Now, that works perfectly and accomplished what I want, however, for a large number of T (e.g. 700) on big datasets (hundreds of thousands of rows), it takes a terrifying amount of time to complete the conversion (30 minutes or so).

I can observe how this (currently) single-threaded piece of code allocates slowly and steadily more and more RAM as it creates the final (N-T-1)xTxD tensor (3 dimensional array).

Is there a way to do it quicker and possibly without allocating such huge amount of memory? I mean, in its core, the values of ds2 are the same as ds1, so I would think a way to do it with pointers should exist (I just can't think of how).

Any possible solution should preferably work on both windows and linux And one last noteworthy thing is that, eventually, this N2xTxD numpy array will be called in batches (so one iteration will call the first b rows, then the next b rows) and this batch will become a PyTorch tensor.

Now, I am familiar with torch.utils.data.Dataset, and I have tried extending it by inhereting from it to make my own iterator:

import numpy as np
from torch.utils.data import Dataset
class TemporalTransformation_Dataset(Dataset):
    def __init__(self, data, T):
        self.data = data
        self.T = T

    def __getitem__(self, index):
            Xi = self.data[index : index   self.T]
            return Xi

    def __len__(self):
        return self.data.shape[0] - self.T   1

t = 2
ds = torch.from_numpy(np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4], [5,5,5]]))
print(f"ds shape:\n{ds.shape}")
print(f"ds:\n{ds}\n")
ds2 = TemporalTransformation_Dataset(ds, t)
ds2_loader = torch.utils.data.DataLoader(dataset = ds2, batch_size = len(ds2), shuffle = False)
print("W/o Y:\n", next(iter(ds2_loader)))

However, it gets significantly slower in training compared to my numpy implementation. We're talking double the time or so, hence it's no fun. That being said, a pytorch solution that is comparably fast to my numpy's one is also something that I could use - I just don't see how to make it faster.. seems like this is a pytorch issue.

CodePudding user response：

"[...] in its core, the values of ds2 are the same as ds1, so I would think a way to do it with pointers should exist" Your intuition is correct. Here's one way you can do that, using NumPy's as_strided function. It creates a new view of the array, without copying the underlying data:

from numpy.lib.stride_tricks import as_strided

def transformed_view(ds, T):
    ds = np.asarray(ds)
    if ds.ndim != 2:
        raise ValueError('ds must be a 2-d array.')
    shp = ds.shape
    if T < 1 or T > shp[0]:
        raise ValueError('Must have 1 <= T <= ds.shape[0]')

    strides = ds.strides
    return as_strided(ds, shape=(shp[0] - T   1, T, shp[1]),
                      strides=(strides[0], strides[0], strides[1]))

For example,

In [49]: xall = np.array([[1,1,1], [2,2,2], [3,3,3], [4,4,4], [5,5,5]], dtype=float)

In [50]: xall
Out[50]: 
array([[1., 1., 1.],
       [2., 2., 2.],
       [3., 3., 3.],
       [4., 4., 4.],
       [5., 5., 5.]])

In [51]: transformed_view(xall, 2)
Out[51]: 
array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[2., 2., 2.],
        [3., 3., 3.]],

       [[3., 3., 3.],
        [4., 4., 4.]],

       [[4., 4., 4.],
        [5., 5., 5.]]])