Home > Blockchain >  How to convert category to numeric in R?
How to convert category to numeric in R?

Time:09-22

This is the dataset called NFL, I tried to run XG Boost, but the error showed me: Error in xgb.DMatrix(X_Train, label = labels) : 'data' has class 'character' and length 64617. 'data' accepts either a numeric matrix or a single filename.

The raw dataset is called NFL I'm trying to set "outcome" as predictor, and I want to make it as numeric. The "outcome" variable has "Win", "Tie", "Loss", I'm trying to show it in dataset as "1", "2", "3"

Here is the code

NFL <- NFL %>% mutate(id = row_number())
#Devided in two groups: TrainSet and validate
trainSet <- train %>% sample_frac(0.7)
validate <- train %>% anti_join(trainSet)

#xg boost    
set.seed(112321)

X_Train <- trainSet %>% select(-outcome) %>% as.matrix()
X_Test <- validate %>% select(-target) %>% as.matrix()
labels <- trainSet$outcome %>% as.matrix()
Train <- xgb.DMatrix(X_Train, label = labels)


xgbModel <- xgboost(data = trainSet, objective = "classification" , 
nrounds = 50, subsample=1, colsample_bytree = 1, max_depth = 10, 
eta=0.2, verbose=FALSE)

xgbPred <- predict(xgbModel, validate)
xgbROC <- evaluate(xgbPred, validate$target)enter code here

Can anybody tell me how to fix this? Thank you very much!

Update: I tried to use: NFL%>% mutate(outcome = ifelse(outcome, c("Win", "Tie", "Loss",1,2,3)))

But it comes with all NAs, here is the photo NA/s

CodePudding user response:

I think the general solution is to convert to factors, and then convert to numeric.

As an example

data <- data.frame(outcome = c("Win", "Tie", "Loss"), other_cols = runif(3))
data$outcome <- as.numeric(factor(data$outcome, levels=c("Win", "Tie", "Loss")))
head(data)
#>   outcome other_cols
#> 1       1 0.08823792
#> 2       2 0.98049935
#> 3       3 0.61575916

Created on 2021-09-22 by the reprex package (v2.0.1)

CodePudding user response:

For xgboost, I recommend using the tidymodels packages for preprocessing. You're also more likely to get interpretable/meaningful results if you convert unordered categorical variables to dummy variables (one column per category) rather than a single numeric column (unless the factor is ordered). For example:

library(tidymodels)

rec <- recipe(outcome_variable ~ ., data = train) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

processed_training_data <- prep(rec) %>% juice()

...will return an updated version of your training data will all categorical variables converted to dummy variables that can be read by xgboost() and the optional step_normalize() will center and scale the numeric predictor variables.

  •  Tags:  
  • r
  • Related