Home > database >  RandomForestRegressor in Julia
RandomForestRegressor in Julia

Time:07-20

I'm trying to train a RandomForestRegressor using DecisionTree.jl and RandomizedSearchCV (contained in ScikitLearn.jl) in Julia. Primary datasets like x_train and y_train etc. are provided in my google drive as well, So you can test it on your machine. The code is as follows:

using CSV
using DataFrames

using ScikitLearn: fit!, predict
using ScikitLearn.GridSearch: RandomizedSearchCV
using DecisionTree

x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)

mod = RandomForestRegressor()

param_dist = Dict("n_trees"=>[50 , 100, 200, 300],
                  "max_depth"=> [3, 5, 6 ,8 , 9 ,10])

model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)

fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))

predict(x_test)

This throws a MethodError like this:

ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Float64}, ::Matrix{Float64})
Closest candidates are:
  fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
  fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
  fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
  ...
Stacktrace:
 [1] _fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64}, parameter_iterable::Vector{Any})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
 [2] fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
 [3] top-level scope
   @ c:\Users\Shayan\Desktop\AUT\Thesis\test.jl:17

If you're curious about the shape of the data:

julia> size(x)
(1550, 70)

julia> size(y_train)
(1550, 10)

How can I solve this problem?

PS: Also I tried:

julia> fit!(model, Matrix{Any}(x), Matrix{Any}(DataFrames.dropmissing(y_train)))

ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Any}, ::Matrix{Any})
Closest candidates are:
  fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
  fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
  fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
  ...
Stacktrace:
 [1] _fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any}, parameter_iterable::Vector{Any})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
 [2] fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
 [3] top-level scope
   @ c:\Users\Shayan\Desktop\AUT\Thesis\MyWork\Thesis.jl:327

CodePudding user response:

Looking at Random Forest Regression example docs in DecisionTree.jl, the example doesn't follow the fit!() / predict() design pattern. The error confirms that fit!() doesn't support RandomForestRegression. Alternatively, you might look at RandomForest.jl package which does follow fit!() / predict() pattern.

CodePudding user response:

As stated here, DecisionTree.jl doesn't support Multi-output RF yet. So I gave up on using DecisionTree.jl, And ScikitLearn.jl is adequate in my case:

using ScikitLearn: @sk_import, fit!, predict
@sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using CSV
using DataFrames


x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)

x_test = reshape(x_test, 1,length(x_test))

mod = RandomForestRegressor()
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
                  "max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)

fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))

predict(model, x_test)

This works fine for me, But it's super slow! Much slower than Python. I'll add the benchmarking with the same data sets across these two languages.

  • Related