Home > Enterprise >  RLM fit model returns "Error in `[.data.frame`(mf, xvars) : undefined columns selected"
RLM fit model returns "Error in `[.data.frame`(mf, xvars) : undefined columns selected"

Time:10-04

I am trying to use the rlm function to create a linear model to test against my training data. Specifically, the data frame trainingData contains 100 predictors (IR Wavelengths from 852nm to 1050nm) and 1 observation (Fat). However, when I try to create a robust linear model (rlm) of the data I get the following error.

"Error in [.data.frame(mf, xvars) : undefined columns selected"

I am trying to model all the IR wavelengths against the Fat observation, which is all contained in the data frame trainingData.

#Loading the "Tecator" data into R
data(tecator)

#Naming columns for easier interpretation
colnames(absorp) <- c(paste0(seq(852,1050,2),'nm'))
colnames(endpoints) <- c('Water','Fat','Protein')

#Creating training and test sets
index <- createDataPartition(endpoints[,'Fat'], p = .7, list = FALSE)
train.absorp    <- as.data.frame(absorp[index,])
test.absorp     <- as.data.frame(absorp[-index,])
train.endpoints <- as.data.frame(endpoints[index,])
test.endpoints  <- as.data.frame(endpoints[-index,])

#Creating training data frame for fat content prediction
trainingData     <- train.absorp
trainingData$Fat <- train.endpoints$Fat

rlmFitAllPredictors <- rlm(Fat ~., data = trainingData)

CodePudding user response:

The issue seems to be related to the column names which starts with digits. If we change it by adding a character in front it would work

names(trainingData)[-ncol(trainingData)] <- paste0("X", names(trainingData)[-ncol(trainingData)])

-testing

rlmFitAllPredictors <- rlm(Fat ~., data = trainingData)

-output structure

> str(rlmFitAllPredictors)
List of 21
 $ coefficients : Named num [1:101] 4.56 13058.33 -12528.76 -13401.94 35613.86 ...
  ..- attr(*, "names")= chr [1:101] "(Intercept)" "X852nm" "X854nm" "X856nm" ...
 $ residuals    : Named num [1:152] 0.0484 0.013 -0.0547 0.0158 -0.0386 ...
  ..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
 $ wresid       : Named num [1:152] 0.0484 0.013 -0.0547 0.0158 -0.0386 ...
  ..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
 $ effects      : Named num [1:152] -210.4 50.5 35.7 59.2 52.3 ...
  ..- attr(*, "names")= chr [1:152] "(Intercept)" "X852nm" "X854nm" "X856nm" ...
 $ rank         : int 101
 $ fitted.values: Named num [1:152] 22.45 40.09 8.45 5.88 25.54 ...
  ..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
...

The reason is because there is a mismatch in column names when the names starts with digits. In the model.matrix, it creates the column names with backquotes. i.e. if we add some print statements it would be clear

rlm_test <- function (formula, data, weights, ..., subset, na.action, method = c("M", 
    "MM", "model.frame"), wt.method = c("inv.var", "case"), model = TRUE, 
    x.ret = TRUE, y.ret = FALSE, contrasts = NULL) 
{
    mf <- match.call(expand.dots = FALSE)
    mf$method <- mf$wt.method <- mf$model <- mf$x.ret <- mf$y.ret <- mf$contrasts <- mf$... <- NULL
    mf[[1L]] <- quote(stats::model.frame)
    mf <- eval.parent(mf)
    method <- match.arg(method)
    wt.method <- match.arg(wt.method)
    if (method == "model.frame") 
        return(mf)
    mt <- attr(mf, "terms")
    print(mt)
    y <- model.response(mf)
    offset <- model.offset(mf)
    if (!is.null(offset)) 
        y <- y - offset
    x <- model.matrix(mt, mf, contrasts)
    print("x vars")
    print(head(x, 2))
    xvars <- as.character(attr(mt, "variables"))[-1L]
    if ((yvar <- attr(mt, "response")) > 0L) 
        xvars <- xvars[-yvar]
    xlev <- if (length(xvars) > 0L) {
        xlev <- lapply(mf[xvars], levels)
        xlev[!sapply(xlev, is.null)]
    }
    weights <- model.weights(mf)
    if (!length(weights)) 
        weights <- rep(1, nrow(x))
    fit <- rlm.default(x, y, weights, method = method, wt.method = wt.method, 
        ...)
    fit$terms <- mt
    cl <- match.call()
    cl[[1L]] <- as.name("rlm")
    fit$call <- cl
    fit$contrasts <- attr(x, "contrasts")
    fit$xlevels <- .getXlevels(mt, mf)
    fit$na.action <- attr(mf, "na.action")
    if (model) 
        fit$model <- mf
    if (!x.ret) 
        fit$x <- NULL
    if (y.ret) 
        fit$y <- y
    fit$offset <- offset
    if (!is.null(offset)) 
        fit$fitted.values <- fit$fitted.values   offset
    fit
}

Now test it again on the original data

 rlm_test(Fat ~., data = trainingData)

part of the output printed

...
attr(,"predvars")
list(Fat, `852nm`, `854nm`, `856nm`, `858nm`, `860nm`, `862nm`, 
    `864nm`, `866nm`, `868nm`, `870nm`, `872nm`, `874nm`, `876nm`, 
    `878nm`, `880nm`, `882nm`, `884nm`, `886nm`, `888nm`, `890nm`, 
    `892nm`, `894nm`, `896nm`, `898nm`, `900nm`, `902nm`, `904nm`, 
    `906nm`, `908nm`, `910nm`, `912nm`, `914nm`, `916nm`, `918nm`, 
    `920nm`, `922nm`, `924nm`, `926nm`, `928nm`, `930nm`, `932nm`, 
    `934nm`, `936nm`, `938nm`, `940nm`, `942nm`, `944nm`, `946nm`, 
    `948nm`, `950nm`, `952nm`, `954nm`, `956nm`, `958nm`, `960nm`, 
    `962nm`, `964nm`, `966nm`, `968nm`, `970nm`, `972nm`, `974nm`, 
    `976nm`, `978nm`, `980nm`, `982nm`, `984nm`, `986nm`, `988nm`, 
    `990nm`, `992nm`, `994nm`, `996nm`, `998nm`, `1000nm`, `1002nm`, 
    `1004nm`, `1006nm`, `1008nm`, `1010nm`, `1012nm`, `1014nm`, 
    `1016nm`, `1018nm`, `1020nm`, `1022nm`, `1024nm`, `1026nm`, 
    `1028nm`, `1030nm`, `1032nm`, `1034nm`, `1036nm`, `1038nm`, 
    `1040nm`, `1042nm`, `1044nm`, `1046nm`, `1048nm`, `1050nm`)
  •  Tags:  
  • r
  • Related