MWE

I have a dataset with a bit more than 1 Mio rows, containing several 100 TimeSeries. Here a simplified MWE of this data:

import pandas as pd

df = pd.DataFrame({"dtime":["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-01", "2022-01-02", "2022-01-03",
                           "2022-01-01", "2022-01-02", "2022-01-03","2022-01-01", "2022-01-02", "2022-01-03"],
                   "Type":["A","A","A","B","B","B","C","C","C","D","D","D"],
                   "Value":[1,2,3,4,6,8,1,5,8,3,1,2]})

 ---- ------------ -------- --------- 
|    | dtime      | Type   |   Value |
|---- ------------ -------- ---------|
|  0 | 2022-01-01 | A      |       1 |
|  1 | 2022-01-02 | A      |       2 |
|  2 | 2022-01-03 | A      |       3 |
|  3 | 2022-01-01 | B      |       4 |
|  4 | 2022-01-02 | B      |       6 |
|  5 | 2022-01-03 | B      |       8 |
|  6 | 2022-01-01 | C      |       1 |
|  7 | 2022-01-02 | C      |       5 |
|  8 | 2022-01-03 | C      |       8 |
|  9 | 2022-01-01 | D      |       3 |
| 10 | 2022-01-02 | D      |       1 |
| 11 | 2022-01-03 | D      |       2 |
 ---- ------------ -------- ---------

Type represents the TimeSeries-Group, so A is one TimeSerie, B another and so on.

Goal

I want to train a multi dimensional TimeSeries-NN (like provided by the unit8-dartspackage).

from darts.models import NBEATSModel
model = NBEATSModel(input_chunk_length=50, output_chunk_length=50, n_epochs=25)
model.fit([train1, train2, train3, train4])

For this I need the Type separated and converted into TimeSeries format and finally split into train/test. Like this:

from darts import TimeSeries

split_date = "2022-01-02"

series1 = TimeSeries.from_dataframe(df[df["Type"] == "A"], "dtime", "Value", freq="D", fillna_value=0)
series2 = TimeSeries.from_dataframe(df[df["Type"] == "B"], "dtime", "Value", freq="D", fillna_value=0)
series3 = TimeSeries.from_dataframe(df[df["Type"] == "C"], "dtime", "Value", freq="D", fillna_value=0)
series4 = TimeSeries.from_dataframe(df[df["Type"] == "D"], "dtime", "Value", freq="D", fillna_value=0)
train1, val1 = series1.split_before(pd.Timestamp(split_date))
train2, val2 = series2.split_before(pd.Timestamp(split_date))
train3, val3 = series3.split_before(pd.Timestamp(split_date))
train4, val4 = series4.split_before(pd.Timestamp(split_date))

But as the real world data has way more than 4 Type to do this procedure manually would be an overkill and so I'm looking for a solution with a loop or a function.

And additionally to the sequentiel series, train and test TimeSeries I want to create a list containing each trainX name like:

ts_list = [train1, train2, train3, train4]

Does somebody has an idea how I can do this? I'm happy for any proposal.

CodePudding user response：

Did you tried to use a groupby over Type column in a loop :

train_list = []
for type, group in df.groupby('Type'):
    series = TimeSeries.from_dataframe(group, "dtime", "Value", freq="D", fillna_value=0)
    train, val = series.split_before(pd.Timestamp(split_date))
    train_list.append(train)

But, with a lot of data it could becomes computively expensive to use loop with Pandas. So maybe a better solution can be found (using other tools like spark for exemple).

CodePudding user response：

The answer given above looks good. I would add that before implementing this, you should probably ask yourself whether:

You want to model your data using one time series per group. In this case, the proposed option looping over groups looks good. You should probably use this representation if the groups representing distinct "observations" of some the same underlying phenomenon (e.g., heart rate series of two distinct patients).
You want to model your data using one time series for all groups, where each group make one dimension of this (multivariate) time series. You should use this when each of the group represent a distinct dimension making up an observation (e.g., heart rate and blood pressure of a single patient). In this latter case, you should transform the dataframe to have the groups in separate columns, and call TimeSeries.from_dataframe() only once.