More efficient way to build dataset then using lists-CodePudding

I am building a dataset for a squence to point conv network, where each window is moved by one timestep. Basically this loop is doing it:

    x_train = []
    y_train = []


    for i in range(window,len(input_train)):
        x_train.append(input_train[i-window:i].tolist())
        y = target_train[i-window:i]
        y = y[int(len(y)/2)]
        y_train.append(y)

When im using a big value for window, e.g. 500 i get a memory error. Is there a way to build the training dataset more efficiently?

CodePudding user response：

You should use pandas. It still might take too much space, but you can try:

import pandas as pd

# if input_train isn't a pd.Series already
input_train = pd.Series(input_train)

rolling_data = (w.reset_index(drop=True) for w in input_train.rolling(window))
x_train = pd.DataFrame(rolling_data).iloc[window - 1:]
y_train = target_train[window//2::window]

Some explanations with an example:

Assuming a simple series:

>>> input_train = pd.Series([1, 2, 3, 4, 5])
>>> input_train
0    1
1    2
2    3
3    4
4    5
dtype: int64

We can create a dataframe with the windowed data like so:

>>> pd.DataFrame(input_train.rolling(2))
     0    1    2    3    4
0  1.0  NaN  NaN  NaN  NaN
1  1.0  2.0  NaN  NaN  NaN
2  NaN  2.0  3.0  NaN  NaN
3  NaN  NaN  3.0  4.0  NaN
4  NaN  NaN  NaN  4.0  5.0

The problem with this is that values in each window have their own indices (0 has 0, 1 has 1, etc.) so they end up in corresponding columns. We can fix this by resetting indices for each window:

>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2))
     0    1
0  1.0  NaN
1  1.0  2.0
2  2.0  3.0
3  3.0  4.0
4  4.0  5.0

The only thing left to do is remove the first window - 1 number of rows because they are not complete (that is just how rolling works):

>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2)).iloc[2-1:] # .iloc[1:]
     0    1
1  1.0  2.0
2  2.0  3.0
3  3.0  4.0
4  4.0  5.0