I am building a dataset for a squence to point conv network, where each window is moved by one timestep. Basically this loop is doing it:
x_train = []
y_train = []
for i in range(window,len(input_train)):
x_train.append(input_train[i-window:i].tolist())
y = target_train[i-window:i]
y = y[int(len(y)/2)]
y_train.append(y)
When im using a big value for window, e.g. 500 i get a memory error. Is there a way to build the training dataset more efficiently?
CodePudding user response:
You should use pandas
. It still might take too much space, but you can try:
import pandas as pd
# if input_train isn't a pd.Series already
input_train = pd.Series(input_train)
rolling_data = (w.reset_index(drop=True) for w in input_train.rolling(window))
x_train = pd.DataFrame(rolling_data).iloc[window - 1:]
y_train = target_train[window//2::window]
Some explanations with an example:
Assuming a simple series:
>>> input_train = pd.Series([1, 2, 3, 4, 5])
>>> input_train
0 1
1 2
2 3
3 4
4 5
dtype: int64
We can create a dataframe with the windowed data like so:
>>> pd.DataFrame(input_train.rolling(2))
0 1 2 3 4
0 1.0 NaN NaN NaN NaN
1 1.0 2.0 NaN NaN NaN
2 NaN 2.0 3.0 NaN NaN
3 NaN NaN 3.0 4.0 NaN
4 NaN NaN NaN 4.0 5.0
The problem with this is that values in each window have their own indices (0 has 0, 1 has 1, etc.) so they end up in corresponding columns. We can fix this by resetting indices for each window:
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2))
0 1
0 1.0 NaN
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0
The only thing left to do is remove the first window - 1
number of rows because they are not complete (that is just how rolling
works):
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2)).iloc[2-1:] # .iloc[1:]
0 1
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0