Home > Blockchain >  How to a split time series data for sklearn classification correctly?
How to a split time series data for sklearn classification correctly?

Time:11-26

I have a dataset (with price data of BTC), I'm trying to predict if the price will go up in the next minute or not (classification).

dataset

How do I exactly split this dataset? When I split it randomly into a train & test set, I get an accuracy of 74%.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

When I disable the shuffle function I get a much worse accuracy (49%).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)

How is this the case? And which method do I use?
Or is there a better way to split a split time series dataset into a train & test dataset?

CodePudding user response:

The correct way to split it is to keep it time ordered. 50% accuracy seems reasonable on this type of data (i.e. you have 50% chance of being right, 50% chance of being wrong).

  • Related