How to construct the ..fit data parameter-CodePudding

I have the following dataset: https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv

I want to predict the price column, based on the other features, basically I want to predict house price based on square meters, number of rooms, postal code, etc.

So I did the following:

Load data:

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='BelgiumRealEstate')
data  =dataset.to_pandas_dataframe()

data.sample(5)


Column1 postal_code city_name   type_of_property    price   number_of_rooms house_area  fully_equipped_kitchen  open_fire   terrace garden  surface_of_the_land number_of_facades   swimming_pool   state_of_the_building   lattitude   longitude   province    region
33580   33580   9850    Landegem    1   380000  3   127 1   0   1   0   0   0   0   as new  3.588809    51.054637   Flandre-Orientale   Flandre
11576   11576   9000    Gent    1   319000  2   89  1   0   1   0   0   2   0   as new  3.714155    51.039713   Flandre-Orientale   Flandre
12830   12830   3300    Bost    0   170000  3   140 1   0   1   1   160 2   0   to renovate 4.933924    50.784632   Brabant flamand Flandre
20736   20736   6880    Cugnon  0   270000  4   218 0   0   0   0   3000    4   0   unknown 5.203308    49.802043   Luxembourg  Wallonie
11416   11416   9000    Gent    0   875000  6   232 1   0   0   1   0   2   0   good    3.714155    51.039713   Flandre-Orientale   Flandre

I hot encoded the category features, city, province, region, state of the building:

one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building) 
one_hot_city = pd.get_dummies(data.city_name, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')
one_hot_region=pd.get_dummies(data.region, prefix ="region")

Then I added those columns to the pandas dataframe

#removing categorical features 
data.drop(['city_name','state_of_the_building','province','region'],axis=1,inplace=True) 
 

#Merging one hot encoded features with our dataset 'data' 
data=pd.concat([data,one_hot_city,one_hot_state_of_the_building,one_hot_province,one_hot_region],axis=1)

I remove the price

x=data.drop('price',axis=1) 
y=data.price

then train test split

from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

then I train:

x_df = DataFrame(x, columns= data.columns)
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.15)

#Converting the data into proper LGB Dataset Format
d_train=lgb.Dataset(x_train, label=y_train)


#Declaring the parameters
params = {
    'task': 'train', 
    'boosting': 'gbdt',
    'objective': 'regression',
    'num_leaves': 10,
    'learnnig_rate': 0.05,
    'metric': {'l2','l1'},
    'verbose': -1
}
#model creation and training
clf=lgb.train(params,d_train,10000)
#model prediction on X_test
y_pred=clf.predict(x_test)
#using RMSE error metric
mean_squared_error(y_pred,y_test)

Then I can predict with random rows.

row = x_test.sample(n = 1, replace = False)
price  = clf.predict(row)
price

247000

Now I want to use random data and not from my test dataset, I want to be able to manually pass the parameters, square meters, number of rooms, etc. Please note that after I hot encoded the category columns the pandas dataframe has over 1000 features.

data.shape
(40395, 1082)

so If I understand correctly I can pass an array to:

clf.fit ([Id,Column1, postal_code, type_of_property, price, number_of_rooms,house_area,fully_equipped_kitchen,open_fire,terrace,garden...

The problem is that there are 1000 features, so I dont know how to construct this data parameter correctly.

and If I try this:

clf.predict([1, 1050,1 ,0,2,100,1,0,1,0])

Then I get this error:

ValueError: Input numpy.ndarray or list must be 2 dimensional

CodePudding user response：

If you want to predict, it is mandatory that the dataset for predictions has the same dimensions as your training set, also you need to process your data with the same steps. In this case, make them dummies. So...

first, I could import the dataset just like this:

df = pd.read_csv(URL)

You can drop the duplicate index also the columns with lat and long, considering that the location is represented in many columns, you can drop this info and left just one representative column, like province:

df.drop(['Unnamed: 0', 'lattitude', 'longitude', 'city_name','state_of_the_building','province','region'],axis=1,inplace=True)

At this point, your df has 13 dimensions. Now, it is good practice to use onehotencoder from sklearn. Get dummies has a few problems. So, you can use onehot in just the province and the state of the building

columns_onehot = oneHotEncoder.fit_transform(df[['state_of_the_building', 'province']]).toarray()
columns_onehot = pd.DataFrame(columns_onehot)

Now, you have your data ready for analysis

df.drop(['state_of_the_building', 'province'],axis=1,inplace=True) 
data=pd.concat([df,columns_onehot],axis=1)

From this point I have just 3 comments:

Keep in mind that the representation of X is capital because is a matrix. This is the convention.
Depending of the method you are using, you should scale your data. Which you are not doing.
You are using train test split 2 times, reassigning the sets. So choose one.

So I use your code from this point, using 0.3 for split.

X=data.drop('price',axis=1) 
y=data.price

from sklearn.model_selection import train_test_split 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)

Then, I build the model:

from sklearn.ensemble import GradientBoostingRegressor
reg = GradientBoostingRegressor()
reg.fit(X_train, y_train)

y_predict = reg.predict(X_test)
reg.score(X_test, y_test)

So, if you want to know predictions from new data. Your new data need same process we did for cleaning. Drop columns --> Onehot encoding. You use the same predict method to check the new values of your model.

y_predict_new_data = reg.predict(X_new_data_after_cleaning)