I have a dataset (with price data of BTC), I'm trying to predict if the price will go up in the next minute or not (classification).
How do I exactly split this dataset? When I split it randomly into a train & test set, I get an accuracy of 74%.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
When I disable the shuffle function I get a much worse accuracy (49%).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
How is this the case? And which method do I use?
Or is there a better way to split a split time series dataset into a train & test dataset?
CodePudding user response:
The correct way to split it is to keep it time ordered. 50% accuracy seems reasonable on this type of data (i.e. you have 50% chance of being right, 50% chance of being wrong).