I`m trying to build a random Forest model for predicting Arsenic concentration using sentinel 2 bands and vnir and mnir spectra as predictors. The model is constructed as follows:
RF_RAW <- randomForest(Arsenic ~ ., data = merged[, c(5, 10:1944)], importance= TRUE, na.action=na.omit )
The problem occurs with names of columns, as the function does not recognize it
Error in eval(predvars, data, env) : object '400' not found
The same happens for any other wavelength if I choose to omit 400, both vnir and mnir spectra, columns of Sentinel bands: B1 to B12 work, and RF works with just them. I can't figure out
head(merged)
# A tibble: 6 x 1,944
Sample_ID Corg H2O KCL Arsenic Phospate FID Longitude Latitude B1 B2 B3 B4 B5 B6 B7 B8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P1_S1_1 1.16 5.6 4.32 47958. 597. 10 16.9 50.4 0.0371 0.0358 0.0527 0.0557 0.101 0.204 0.224 0.219
2 P1_S1_2 0.470 6.24 4.7 50720. 398. 9 16.9 50.4 0.0371 0.0395 0.0606 0.0641 0.101 0.204 0.224 0.259
3 P2_S1_1 1.33 6.28 4.91 31055. 316. 8 16.9 50.4 0.0323 0.0436 0.0682 0.0729 0.107 0.182 0.226 0.232
4 P2_S1_2 12.7 7.67 7.07 37492. 695. 7 16.9 50.4 0.0322 0.034 0.0576 0.0422 0.102 0.223 0.231 0.288
5 P3_S3_1 0.617 6.76 5.64 54245. 249. 6 16.9 50.4 0.0322 0.0438 0.0662 0.0696 0.113 0.213 0.253 0.233
6 P4_S4_1 20.8 4.41 3.21 7175. 731. 11 16.9 50.4 0.027 0.0253 0.0554 0.0272 0.0966 0.281 0.356 0.404
# ... with 1,927 more variables: B8A <dbl>, B9 <dbl>, B11 <dbl>, B12 <dbl>, 400 <dbl>, 401 <dbl>, 402 <dbl>, 403 <dbl>,
# 404 <dbl>, 405 <dbl>, 406 <dbl>, 407 <dbl>, 408 <dbl>, 409 <dbl>, 410 <dbl>, 411 <dbl>, 412 <dbl>, 413 <dbl>, 414 <dbl>,
# 415 <dbl>, 416 <dbl>, 417 <dbl>, 418 <dbl>, 419 <dbl>, 420 <dbl>, 421 <dbl>, 422 <dbl>, 423 <dbl>, 424 <dbl>, 425 <dbl>,
# 426 <dbl>, 427 <dbl>, 428 <dbl>, 429 <dbl>, 430 <dbl>, 431 <dbl>, 432 <dbl>, 433 <dbl>, 434 <dbl>, 435 <dbl>, 436 <dbl>,
# 437 <dbl>, 438 <dbl>, 439 <dbl>, 440 <dbl>, 441 <dbl>, 442 <dbl>, 443 <dbl>, 444 <dbl>, 445 <dbl>, 446 <dbl>, 447 <dbl>,
# 448 <dbl>, 449 <dbl>, 450 <dbl>, 451 <dbl>, 452 <dbl>, 453 <dbl>, 454 <dbl>, 455 <dbl>, 456 <dbl>, 457 <dbl>, 458 <dbl>,
# 459 <dbl>, 460 <dbl>, 461 <dbl>, 462 <dbl>, 463 <dbl>, 464 <dbl>, 465 <dbl>, 466 <dbl>, 467 <dbl>, 468 <dbl>, ...,
colnames(merged)
[1] "Sample_ID" "Corg" "H2O" "KCL" "Arsenic" "Phospate" "FID" "Longitude" "Latitude"
[10] "B1" "B2" "B3" "B4" "B5" "B6" "B7" "B8" "B8A"
[19] "B9" "B11" "B12" "400" "401" "402" "403" "404" "405"
[28] "406" "407" "408" "409" "410" "411" "412" "413" "414"
[37] "415" "416" "417" "418" "419" "420" "421" "422" "423"
[46] "424" "425" "426" "427" "428" "429" "430" "431" "432"
and so forth. https://drive.google.com/file/d/1xifstUBv6sqa8-c51ukRsw9oZyKtzK03/view?usp=sharing is csv merged used in this example. I have similar ones with applied spectral transformations and they behave the same.
Thanks in advance
CodePudding user response:
Looks like randomForest doesn't like numeric column names, fix the column names. Try this example:
# example data
x <- mtcars[1:10, 1:3]
x[, "400"] <- x$disp
library(randomForest)
colnames(x)
# [1] "mpg" "cyl" "disp" "400"
# as expected we get error:
randomForest(mpg ~ ., data = x)
# Error in eval(predvars, data, env) : object '400' not found
# now fix the column names
colnames(x) <- make.names(colnames(x))
colnames(x)
# [1] "mpg" "cyl" "disp" "X400"
randomForest(mpg ~ ., data = x)
# Call:
# randomForest(formula = mpg ~ ., data = x)
# Type of random forest: regression
# Number of trees: 500
# No. of variables tried at each split: 1
#
# Mean of squared residuals: 5.218906
# % Var explained: 31.39