Home > database >  Type of predictors in new data do not match that of the training data
Type of predictors in new data do not match that of the training data

Time:04-19

I want to make prediction of suicide rate (log_suicides_per_100k) in R using random forest, the problem I have is that when I try to pick one level of a variable, I get the error:

Type of predictors in new data do not match that of the training data. 

The model is:

rf3 <- randomForest(log_suicides_per_100k~ age sex log_gdp_per_capita log_population year, # formula data = train, # data ntree = 500)

sex has four levels: male and female age has six levels; "15-24 years", "25-34 years", "35-54 years", "5-14 years", "55-74 years", "75 years"

structure(list(year = c(2001L, 2004L, 2008L, 2010L, 2004L, 2011L
), sex = structure(c(2L, 2L, 1L, 2L, 2L, 1L), .Label = c("female", 
"male"), class = "factor"), age = structure(c(1L, 6L, 3L, 6L, 
2L, 3L), .Label = c("15-24 years", "25-34 years", "35-54 years", 
"5-14 years", "55-74 years", "75  years"), class = "factor"), 
log_population = c(14.0462476055718, 10.0651811415341, 
13.5550389013841, 
10.2665669441479, 15.5047227728237, 13.4021140795298), 
log_suicides_per_100k = c(2.42657107277504, 
4.03069453514564, 2.38508631450579, 4.15261347034608, 
2.88480071284671, 
0.647103242058539), log_gdp_per_capita = c(7.67786350067821, 
9.13701670755734, 11.1338150021447, 9.65117262392164, 
7.95472333449791, 
8.14177220465645)), row.names = c(7888L, 8465L, 7593L, 8535L, 
25159L, 9656L), class = "data.frame")

I want to predict the suicide rate for males in the group age 75 for the year 2025.

prediction <- predict(rf3, data.frame (age = '75  years', sex= 'male', log_gdp_per_capita = 13.082, log_population = 9.393, year = 2025))

CodePudding user response:

Here's some code that works. Because you haven't included all your code, there is a risk that it will not work for you. The factors and levels need to match up so this is the key thing to get right. The factors and levels in the training data are copied and set to match those in the test data.

library(randomForest)

traindf <- structure(
    list(
        year = c(2001L, 2004L, 2008L, 2010L, 2004L, 2011L),
        sex = structure(
            c(2L, 2L, 1L, 2L, 2L, 1L),
            .Label = c("female",
                                 "male"),
            class = "factor"
        ),
        age = structure(
            c(1L, 6L, 3L, 6L,
                2L, 3L),
            .Label = c(
                "15-24 years",
                "25-34 years",
                "35-54 years",
                "5-14 years",
                "55-74 years",
                "75  years"
            ),
            class = "factor"
        ),
        log_population = c(
            14.0462476055718,
            10.0651811415341,
            13.5550389013841,
            10.2665669441479,
            15.5047227728237,
            13.4021140795298
        ),
        log_suicides_per_100k = c(
            2.42657107277504,
            4.03069453514564,
            2.38508631450579,
            4.15261347034608,
            2.88480071284671,
            0.647103242058539
        ),
        log_gdp_per_capita = c(
            7.67786350067821,
            9.13701670755734,
            11.1338150021447,
            9.65117262392164,
            7.95472333449791,
            8.14177220465645
        )
    ),
    row.names = c(7888L, 8465L, 7593L, 8535L,
                                25159L, 9656L),
    class = "data.frame"
)

rf3 <- randomForest(log_suicides_per_100k ~ age sex log_gdp_per_capita log_population year, data=traindf)

testdf <- data.frame(age='75  years', sex='male', log_gdp_per_capita=13.082, log_population=9.393, year=2025)
testdf$sex <- factor(testdf$sex, levels=levels(traindf$sex))
testdf$age <- factor(testdf$age, levels=levels(traindf$age))

prediction <- predict(rf3, testdf)
prediction

#3.200609 
  • Related