X has 1 features, but LinearRegression is expecting 5 features as input-CodePudding

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

dados = pd.read_csv("dados.csv", thousands=',', sep = ";", header = 0, encoding='latin-1')

dados.drop('pais', axis = 1, inplace=True)

df = dados.to_numpy()
g = [df[:,1]]
h = [df[:,0]]

#plt.scatter(x,y, color = 'blue')
plt.scatter(g,h, color = 'blue')

model=sklearn.linear_model.LinearRegression()
model.fit(g,h)

G_new=[[22500]]
print(model.predict(G_new))

X has 1 features, but LinearRegression is expecting 5 features as input.

How to solve this?

CodePudding user response：

X does not expect 5 features — it's fine with 1 feature or 100,000 features — but it does need to be a 2D array. You are passing a 1D array (well, a Pandas Series, but it amounts to the same thing).

Here's how I would define X and y (which you call g and h):

X = [df[:,1]].values.reshape(-1, 1)
y = [df[:,0]].values

The reshape method transforms the 1D array into a 2D array (a 'column vector' if you like); if you were selecting more than 1 column you would not need this reshaping.

I cast them to NumPy arrays with .values because I prefer NumPy for slinging sklearn data around. Pandas is great for data wrangling, but once I make X and y for the ML task, I move to NumPy. Personal preference.

By the way, people use uppercase X to indicate that it should be a matrix, i.e. 2D. It's a mathematical convention.

CodePudding user response：

I assume g is X_train and h is y_train. The shape of g and h should be defined with the correct shape. Try this:

from sklearn.linear_model import LinearRegression

df = dados.to_numpy()
g = df.iloc[:, 1:]
h = df.iloc[:, 0]

model = LinearRegression()
model.fit(g, h)
G_new = [[22500]]
print(model.predict(G_new))