I am making a sklearn model (Random Forest Regressor), and have been successful in training it with my data, however, I am unsure of how to predict it.
My CSV contains 2 items per row, a year (expressed in years since 2003), and a number (what's being predicted), usually above 1,000. When I use model.predict([[20]])
, I get a decimal for a number that is supposed to be in the thousands despite a very high r^2 value:
R-squared: 0.9804779528842772 Prediction: [0.67932727]
I have a feeling I'm not using this method correctly, but I couldn't really find anything online. A user from another question of mine said that the last item in a CSV row was supposed to be the output, so I assumed that is how it works. Please forgive me if something is unclear, just comment and I will try my best to clarify, I am a noob at this.
Code:
import pandas
import scipy
import numpy
import matplotlib
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
from sklearn import set_config
from pandas import read_csv
names = ['YEAR', 'TOTAL']
url = 'energy/energyTotal.csv'
dataset = read_csv(url, names=names)
array = dataset.values
x = array[:, 0:1]
y = array[:, 1]
y=y.astype('int')
# rfr = RandomForestRegressor(max_depth=3)
# rfr.fit(x, y)
# print(rfr.predict([[0, 1, 0, 1]]))
x = scale(x)
y = scale(y)
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.10)
#Train model
set_config(print_changed_only=False)
rfr = RandomForestRegressor()
print(rfr)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
rfr.fit(xtrain, ytrain)
score = rfr.score(xtrain, ytrain)
print("R-squared:", score)
print(rfr.predict([[20]]))
The CSV:
18,28564
17,28411
16,27515
15,24586
14,26653
13,26836
12,26073
11,27055
10,26236
9,26020
8,26538
7,25800
6,26682
5,24997
4,25100
3,24651
2,12053
1,11500
CodePudding user response:
Your data has been scaled, so your predictions are not in the original range of the TOTAL variable. You can try to train your model without scaling the data and results are still quite good. I would recommend scaling only the training set, to avoid leaking information about the whole dataset into the test set. And you need to know the scaling to reverse your predictions into the original range.