I have the following dataset: https://raw.githubusercontent.com/Joffreybvn/real-estate-data-analysis/master/data/clean/belgium_real_estate.csv
I want to predict the price column, based on the other features, basically I want to predict house price based on square meters, number of rooms, postal code, etc.
So I did the following:
Load data:
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='BelgiumRealEstate')
data =dataset.to_pandas_dataframe()
data.sample(5)
Column1 postal_code city_name type_of_property price number_of_rooms house_area fully_equipped_kitchen open_fire terrace garden surface_of_the_land number_of_facades swimming_pool state_of_the_building lattitude longitude province region
33580 33580 9850 Landegem 1 380000 3 127 1 0 1 0 0 0 0 as new 3.588809 51.054637 Flandre-Orientale Flandre
11576 11576 9000 Gent 1 319000 2 89 1 0 1 0 0 2 0 as new 3.714155 51.039713 Flandre-Orientale Flandre
12830 12830 3300 Bost 0 170000 3 140 1 0 1 1 160 2 0 to renovate 4.933924 50.784632 Brabant flamand Flandre
20736 20736 6880 Cugnon 0 270000 4 218 0 0 0 0 3000 4 0 unknown 5.203308 49.802043 Luxembourg Wallonie
11416 11416 9000 Gent 0 875000 6 232 1 0 0 1 0 2 0 good 3.714155 51.039713 Flandre-Orientale Flandre
I hot encoded the category features, city, province, region, state of the building:
one_hot_state_of_the_building=pd.get_dummies(data.state_of_the_building)
one_hot_city = pd.get_dummies(data.city_name, prefix='city')
one_hot_province = pd.get_dummies(data.province, prefix='province')
one_hot_region=pd.get_dummies(data.region, prefix ="region")
Then I added those columns to the pandas dataframe
#removing categorical features
data.drop(['city_name','state_of_the_building','province','region'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_city,one_hot_state_of_the_building,one_hot_province,one_hot_region],axis=1)
I remove the price
x=data.drop('price',axis=1)
y=data.price
then train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
then I train:
x_df = DataFrame(x, columns= data.columns)
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.15)
#Converting the data into proper LGB Dataset Format
d_train=lgb.Dataset(x_train, label=y_train)
#Declaring the parameters
params = {
'task': 'train',
'boosting': 'gbdt',
'objective': 'regression',
'num_leaves': 10,
'learnnig_rate': 0.05,
'metric': {'l2','l1'},
'verbose': -1
}
#model creation and training
clf=lgb.train(params,d_train,10000)
#model prediction on X_test
y_pred=clf.predict(x_test)
#using RMSE error metric
mean_squared_error(y_pred,y_test)
Then I can predict with random rows.
row = x_test.sample(n = 1, replace = False)
price = clf.predict(row)
price
247000
Now I want to use random data and not from my test dataset, I want to be able to manually pass the parameters, square meters, number of rooms, etc. Please note that after I hot encoded the category columns the pandas dataframe has over 1000 features.
data.shape
(40395, 1082)
so If I understand correctly I can pass an array to:
clf.fit ([Id,Column1, postal_code, type_of_property, price, number_of_rooms,house_area,fully_equipped_kitchen,open_fire,terrace,garden...
The problem is that there are 1000 features, so I dont know how to construct this data parameter correctly.
and If I try this:
clf.predict([1, 1050,1 ,0,2,100,1,0,1,0])
Then I get this error:
ValueError: Input numpy.ndarray or list must be 2 dimensional
CodePudding user response:
If you want to predict, it is mandatory that the dataset for predictions has the same dimensions as your training set, also you need to process your data with the same steps. In this case, make them dummies. So...
first, I could import the dataset just like this:
df = pd.read_csv(URL)
You can drop the duplicate index also the columns with lat and long, considering that the location is represented in many columns, you can drop this info and left just one representative column, like province:
df.drop(['Unnamed: 0', 'lattitude', 'longitude', 'city_name','state_of_the_building','province','region'],axis=1,inplace=True)
At this point, your df has 13 dimensions. Now, it is good practice to use onehotencoder from sklearn. Get dummies has a few problems. So, you can use onehot in just the province and the state of the building
columns_onehot = oneHotEncoder.fit_transform(df[['state_of_the_building', 'province']]).toarray()
columns_onehot = pd.DataFrame(columns_onehot)
Now, you have your data ready for analysis
df.drop(['state_of_the_building', 'province'],axis=1,inplace=True)
data=pd.concat([df,columns_onehot],axis=1)
From this point I have just 3 comments:
- Keep in mind that the representation of X is capital because is a matrix. This is the convention.
- Depending of the method you are using, you should scale your data. Which you are not doing.
- You are using train test split 2 times, reassigning the sets. So choose one.
So I use your code from this point, using 0.3 for split.
X=data.drop('price',axis=1)
y=data.price
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
Then, I build the model:
from sklearn.ensemble import GradientBoostingRegressor
reg = GradientBoostingRegressor()
reg.fit(X_train, y_train)
y_predict = reg.predict(X_test)
reg.score(X_test, y_test)
So, if you want to know predictions from new data. Your new data need same process we did for cleaning. Drop columns --> Onehot encoding. You use the same predict method to check the new values of your model.
y_predict_new_data = reg.predict(X_new_data_after_cleaning)