My ML model doesn't work, can you help me?-CodePudding

I built an ML model but it doesn't work because of a dimensional problem. Basically, the inverse_transform method expects a 2D array, but the predict method only give as result a 1D array. Can you help me? There is my code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
Y = dataset.iloc[:, -1].values
Y = Y.reshape(len(Y),1)

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X = sc_X.fit_transform(X)
Y = sc_Y.fit_transform(Y)

from sklearn.svm import SVR
regressor = SVR(kernel='rbf')
regressor.fit(X, Y)

sc_Y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])))

This is the ValueError: Expected 2D array, got 1D array instead: array=[0.01].

CodePudding user response：

This code should solve your problem:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
Y = dataset.iloc[:, -1].values
Y = Y.reshape(len(Y),1)  # this step is unnnnecesary: in fact you should reshape to (-1)

sc_X = StandardScaler()
sc_Y = StandardScaler()
X = sc_X.fit_transform(X)
Y = sc_Y.fit_transform(Y)


regressor = SVR(kernel='rbf')
regressor.fit(X, Y)

sc_Y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])).reshape(-1,1))  # not the final reshape(-1,1)

It's unusual using labels with shape (-1,1), the most common to use them as (-1,).

You can check your examples and labels shape with print(X.shape) and print(Y.shape), it should output (n_samples, n_features) and (n_samples,) respectively. Then, your code should be the following:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
Y = dataset.iloc[:, -1].values

sc_X = StandardScaler()
sc_Y = StandardScaler()
X = sc_X.fit_transform(X)
Y = sc_Y.fit_transform(Y)


regressor = SVR(kernel='rbf')
regressor.fit(X, Y)

sc_Y.inverse_transform(
    regressor.predict(
        sc_X.transform([[6.5]])
    )
)

CodePudding user response：

I can tell you from experience that this kind of thing will drive you nuts. So I recommend not doing this, but instead using sklearn.compose.TransformedTargetRegressor. It will save you having to inverse transform the target, which is much more convenient, and therefore safer (because it's harder to mess it up!). You can also easily pass in custom transformers if you want.

I'd go further then to say you should use a Pipeline to manage scaling in general. It's nice because then you can forget about having to scale data when you predict later because the scaler is part of the pipeline. If you choose to estimate model performance with holdout or cross-validation splits later, you also don't have to remember to fit the scaler to the training data only.

Here's how I'd do it (I'm using a small y for the target, in accordance with convention):

import pandas as pd
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import make_pipeline

df = pd.read_csv('Position_Salaries.csv')
X = df.iloc[:, 1:-1].values
y = df.iloc[:, -1].values

ttr = TransformedTargetRegressor(regressor=SVR(kernel='rbf'),
                                 transformer=StandardScaler(),
                                 )
pipe = make_pipeline(StandardScaler(), ttr).fit(X, y)

As you can see, it's a lot less code, and the beauty is that you can pass new data directly to this pipe. The pipeline takes care of scaling the input data X, and the TransformedTargetRegressor takes care of scaling and inverse-transforming the target y. So this gives you a prediction in the correct units:

pipe.predict([[6.5]])  # Assuming there's only one feature.

Some notes, in case you want to develop this further:

You may not actually need to scale the target. Try not scaling it and see if it makes a difference; I suspect it won't. It usually just makes handling the results really fiddly (e.g. calculating errors), and often is not necessary. You do need to scale X with a nonlinear SVR though.
If you split the data later, e.g. into training and validation sets, then remember to fit the scaler to the training data only, otherwise you'll leak information about the validation data distribution to the model.