How to improve Regression RMSE with LightGBM-CodePudding

I have the following dataset: https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv

I want to predict the price column, based on the other features, basically I want to predict house price based on square meters, number of rooms, postal code, etc.

So I did the following:

Load data:

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='BelgiumRealEstate')
data  =dataset.to_pandas_dataframe()

data.sample(5)


Column1 postal_code city_name   type_of_property    price   number_of_rooms house_area  fully_equipped_kitchen  open_fire   terrace garden  surface_of_the_land number_of_facades   swimming_pool   state_of_the_building   lattitude   longitude   province    region
33580   33580   9850    Landegem    1   380000  3   127 1   0   1   0   0   0   0   as new  3.588809    51.054637   Flandre-Orientale   Flandre
11576   11576   9000    Gent    1   319000  2   89  1   0   1   0   0   2   0   as new  3.714155    51.039713   Flandre-Orientale   Flandre
12830   12830   3300    Bost    0   170000  3   140 1   0   1   1   160 2   0   to renovate 4.933924    50.784632   Brabant flamand Flandre
20736   20736   6880    Cugnon  0   270000  4   218 0   0   0   0   3000    4   0   unknown 5.203308    49.802043   Luxembourg  Wallonie
11416   11416   9000    Gent    0   875000  6   232 1   0   0   1   0   2   0   good    3.714155    51.039713   Flandre-Orientale   Flandre

I hot encoded the category features, city, province, region, state of the building:

one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building) 
one_hot_city = pd.get_dummies(data.city_name, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')
one_hot_region=pd.get_dummies(data.region, prefix ="region")

Then I added those columns to the pandas dataframe

#removing categorical features 
data.drop(['city_name','state_of_the_building','province','region'],axis=1,inplace=True) 
 

#Merging one hot encoded features with our dataset 'data' 
data=pd.concat([data,one_hot_city,one_hot_state_of_the_building,one_hot_province,one_hot_region],axis=1)

I remove the price

x=data.drop('price',axis=1) 
y=data.price

then train test split

from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

then I train:

x_df = DataFrame(x, columns= data.columns)
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.15)

#Converting the data into proper LGB Dataset Format
d_train=lgb.Dataset(x_train, label=y_train)


#Declaring the parameters
params = {
    'task': 'train', 
    'boosting': 'gbdt',
    'objective': 'regression',
    'num_leaves': 10,
    'learnnig_rate': 0.05,
    'metric': {'l2','l1'},
    'verbose': -1
}
#model creation and training
clf=lgb.train(params,d_train,10000)
#model prediction on X_test
y_pred=clf.predict(x_test)
#using RMSE error metric
mean_squared_error(y_pred,y_test)

However the RMSE its: 6053845952.2186775

which seems a huge number.

I am not sure what I am doing wrong here

CodePudding user response：

I assume you are using sklearn.metrics.mean_squared_error, thus just the MSE, without taking the root. Then 6053845952 ** 0.5 = 77806, which seems to me to be a reasonable mean absolute error for the quoted prices (e.g. that would correspond to less than 10% off for a price of 875000).