import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
dados = pd.read_csv("dados.csv", thousands=',', sep = ";", header = 0, encoding='latin-1')
dados.drop('pais', axis = 1, inplace=True)
df = dados.to_numpy()
g = [df[:,1]]
h = [df[:,0]]
#plt.scatter(x,y, color = 'blue')
plt.scatter(g,h, color = 'blue')
model=sklearn.linear_model.LinearRegression()
model.fit(g,h)
G_new=[[22500]]
print(model.predict(G_new))
X has 1 features, but LinearRegression is expecting 5 features as input.
How to solve this?
CodePudding user response:
X
does not expect 5 features — it's fine with 1 feature or 100,000 features — but it does need to be a 2D array. You are passing a 1D array (well, a Pandas Series, but it amounts to the same thing).
Here's how I would define X
and y
(which you call g
and h
):
X = [df[:,1]].values.reshape(-1, 1)
y = [df[:,0]].values
The reshape
method transforms the 1D array into a 2D array (a 'column vector' if you like); if you were selecting more than 1 column you would not need this reshaping.
I cast them to NumPy arrays with .values
because I prefer NumPy for slinging sklearn
data around. Pandas is great for data wrangling, but once I make X
and y
for the ML task, I move to NumPy. Personal preference.
By the way, people use uppercase X
to indicate that it should be a matrix, i.e. 2D. It's a mathematical convention.
CodePudding user response:
I assume g
is X_train and h
is y_train. The shape of g
and h
should be defined with the correct shape. Try this:
from sklearn.linear_model import LinearRegression
df = dados.to_numpy()
g = df.iloc[:, 1:]
h = df.iloc[:, 0]
model = LinearRegression()
model.fit(g, h)
G_new = [[22500]]
print(model.predict(G_new))