This is the dataset called NFL, I tried to run XG Boost, but the error showed me: Error in xgb.DMatrix(X_Train, label = labels) : 'data' has class 'character' and length 64617. 'data' accepts either a numeric matrix or a single filename.
The raw dataset is called NFL I'm trying to set "outcome" as predictor, and I want to make it as numeric. The "outcome" variable has "Win", "Tie", "Loss", I'm trying to show it in dataset as "1", "2", "3"
Here is the code
NFL <- NFL %>% mutate(id = row_number())
#Devided in two groups: TrainSet and validate
trainSet <- train %>% sample_frac(0.7)
validate <- train %>% anti_join(trainSet)
#xg boost
set.seed(112321)
X_Train <- trainSet %>% select(-outcome) %>% as.matrix()
X_Test <- validate %>% select(-target) %>% as.matrix()
labels <- trainSet$outcome %>% as.matrix()
Train <- xgb.DMatrix(X_Train, label = labels)
xgbModel <- xgboost(data = trainSet, objective = "classification" ,
nrounds = 50, subsample=1, colsample_bytree = 1, max_depth = 10,
eta=0.2, verbose=FALSE)
xgbPred <- predict(xgbModel, validate)
xgbROC <- evaluate(xgbPred, validate$target)enter code here
Can anybody tell me how to fix this? Thank you very much!
Update: I tried to use: NFL%>% mutate(outcome = ifelse(outcome, c("Win", "Tie", "Loss",1,2,3)))
But it comes with all NAs, here is the photo NA/s
CodePudding user response:
I think the general solution is to convert to factors, and then convert to numeric.
As an example
data <- data.frame(outcome = c("Win", "Tie", "Loss"), other_cols = runif(3))
data$outcome <- as.numeric(factor(data$outcome, levels=c("Win", "Tie", "Loss")))
head(data)
#> outcome other_cols
#> 1 1 0.08823792
#> 2 2 0.98049935
#> 3 3 0.61575916
Created on 2021-09-22 by the reprex package (v2.0.1)
CodePudding user response:
For xgboost, I recommend using the tidymodels packages for preprocessing. You're also more likely to get interpretable/meaningful results if you convert unordered categorical variables to dummy variables (one column per category) rather than a single numeric column (unless the factor is ordered). For example:
library(tidymodels)
rec <- recipe(outcome_variable ~ ., data = train) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
processed_training_data <- prep(rec) %>% juice()
...will return an updated version of your training data will all categorical variables converted to dummy variables that can be read by xgboost() and the optional step_normalize() will center and scale the numeric predictor variables.