I am trying to use the rlm function to create a linear model to test against my training data. Specifically, the data frame trainingData contains 100 predictors (IR Wavelengths from 852nm to 1050nm) and 1 observation (Fat). However, when I try to create a robust linear model (rlm) of the data I get the following error.
"Error in [.data.frame(mf, xvars) : undefined columns selected"
I am trying to model all the IR wavelengths against the Fat observation, which is all contained in the data frame trainingData.
#Loading the "Tecator" data into R
data(tecator)
#Naming columns for easier interpretation
colnames(absorp) <- c(paste0(seq(852,1050,2),'nm'))
colnames(endpoints) <- c('Water','Fat','Protein')
#Creating training and test sets
index <- createDataPartition(endpoints[,'Fat'], p = .7, list = FALSE)
train.absorp <- as.data.frame(absorp[index,])
test.absorp <- as.data.frame(absorp[-index,])
train.endpoints <- as.data.frame(endpoints[index,])
test.endpoints <- as.data.frame(endpoints[-index,])
#Creating training data frame for fat content prediction
trainingData <- train.absorp
trainingData$Fat <- train.endpoints$Fat
rlmFitAllPredictors <- rlm(Fat ~., data = trainingData)
CodePudding user response:
The issue seems to be related to the column names which starts with digits. If we change it by adding a character in front it would work
names(trainingData)[-ncol(trainingData)] <- paste0("X", names(trainingData)[-ncol(trainingData)])
-testing
rlmFitAllPredictors <- rlm(Fat ~., data = trainingData)
-output structure
> str(rlmFitAllPredictors)
List of 21
$ coefficients : Named num [1:101] 4.56 13058.33 -12528.76 -13401.94 35613.86 ...
..- attr(*, "names")= chr [1:101] "(Intercept)" "X852nm" "X854nm" "X856nm" ...
$ residuals : Named num [1:152] 0.0484 0.013 -0.0547 0.0158 -0.0386 ...
..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
$ wresid : Named num [1:152] 0.0484 0.013 -0.0547 0.0158 -0.0386 ...
..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
$ effects : Named num [1:152] -210.4 50.5 35.7 59.2 52.3 ...
..- attr(*, "names")= chr [1:152] "(Intercept)" "X852nm" "X854nm" "X856nm" ...
$ rank : int 101
$ fitted.values: Named num [1:152] 22.45 40.09 8.45 5.88 25.54 ...
..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
...
The reason is because there is a mismatch in column names when the names starts with digits. In the model.matrix
, it creates the column names with backquotes. i.e. if we add some print
statements it would be clear
rlm_test <- function (formula, data, weights, ..., subset, na.action, method = c("M",
"MM", "model.frame"), wt.method = c("inv.var", "case"), model = TRUE,
x.ret = TRUE, y.ret = FALSE, contrasts = NULL)
{
mf <- match.call(expand.dots = FALSE)
mf$method <- mf$wt.method <- mf$model <- mf$x.ret <- mf$y.ret <- mf$contrasts <- mf$... <- NULL
mf[[1L]] <- quote(stats::model.frame)
mf <- eval.parent(mf)
method <- match.arg(method)
wt.method <- match.arg(wt.method)
if (method == "model.frame")
return(mf)
mt <- attr(mf, "terms")
print(mt)
y <- model.response(mf)
offset <- model.offset(mf)
if (!is.null(offset))
y <- y - offset
x <- model.matrix(mt, mf, contrasts)
print("x vars")
print(head(x, 2))
xvars <- as.character(attr(mt, "variables"))[-1L]
if ((yvar <- attr(mt, "response")) > 0L)
xvars <- xvars[-yvar]
xlev <- if (length(xvars) > 0L) {
xlev <- lapply(mf[xvars], levels)
xlev[!sapply(xlev, is.null)]
}
weights <- model.weights(mf)
if (!length(weights))
weights <- rep(1, nrow(x))
fit <- rlm.default(x, y, weights, method = method, wt.method = wt.method,
...)
fit$terms <- mt
cl <- match.call()
cl[[1L]] <- as.name("rlm")
fit$call <- cl
fit$contrasts <- attr(x, "contrasts")
fit$xlevels <- .getXlevels(mt, mf)
fit$na.action <- attr(mf, "na.action")
if (model)
fit$model <- mf
if (!x.ret)
fit$x <- NULL
if (y.ret)
fit$y <- y
fit$offset <- offset
if (!is.null(offset))
fit$fitted.values <- fit$fitted.values offset
fit
}
Now test it again on the original data
rlm_test(Fat ~., data = trainingData)
part of the output print
ed
...
attr(,"predvars")
list(Fat, `852nm`, `854nm`, `856nm`, `858nm`, `860nm`, `862nm`,
`864nm`, `866nm`, `868nm`, `870nm`, `872nm`, `874nm`, `876nm`,
`878nm`, `880nm`, `882nm`, `884nm`, `886nm`, `888nm`, `890nm`,
`892nm`, `894nm`, `896nm`, `898nm`, `900nm`, `902nm`, `904nm`,
`906nm`, `908nm`, `910nm`, `912nm`, `914nm`, `916nm`, `918nm`,
`920nm`, `922nm`, `924nm`, `926nm`, `928nm`, `930nm`, `932nm`,
`934nm`, `936nm`, `938nm`, `940nm`, `942nm`, `944nm`, `946nm`,
`948nm`, `950nm`, `952nm`, `954nm`, `956nm`, `958nm`, `960nm`,
`962nm`, `964nm`, `966nm`, `968nm`, `970nm`, `972nm`, `974nm`,
`976nm`, `978nm`, `980nm`, `982nm`, `984nm`, `986nm`, `988nm`,
`990nm`, `992nm`, `994nm`, `996nm`, `998nm`, `1000nm`, `1002nm`,
`1004nm`, `1006nm`, `1008nm`, `1010nm`, `1012nm`, `1014nm`,
`1016nm`, `1018nm`, `1020nm`, `1022nm`, `1024nm`, `1026nm`,
`1028nm`, `1030nm`, `1032nm`, `1034nm`, `1036nm`, `1038nm`,
`1040nm`, `1042nm`, `1044nm`, `1046nm`, `1048nm`, `1050nm`)