I wanted to fit a smoothing spline to some data and I noticed that the internally computed LOOCV-error seems to depend on whether the data is unordered or not. Concretely, I only get the expected result when the data is ordered.
I don't see why this is to be expected? Any help?
set.seed(0)
x <- seq(1:10)
y <- x^2 rnorm(10,0,2)
fit.ss <- smooth.spline(x=x, y=y, cv=TRUE)
cat("CV ordered: ",format(fit.ss$cv.crit))
# CV ordered: 13.46173
xu <- sample(x)
yu <- y[xu]
fit.ss.u <- smooth.spline(x=xu, y=yu, cv=TRUE)
cat("CV unorderd: ",format(fit.ss.u$cv.crit))
# CV unorderd: 65552.74
spar.opt <- fit.ss$spar
preds <- rep(NA, 10)
for (i in 1:10){
ss <- smooth.spline(x=x[-i], y=y[-i], cv=TRUE, spar=spar.opt)
preds[i] <- predict(ss,x=x[i])$y
}
cat("CV manual: ",format(mean((preds - y)**2)))
# CV manual: 13.49424
CV ordered and CV manual are (almost) the same and as expected, whereas the unordered version is completely off.
Note that this is a duplicate of https://stats.stackexchange.com/q/561802/213798, where I don't seem to get any input.
CodePudding user response:
Looks like a bug in smooth.spline
. When it calculates cv.crit
internally, it compares observations in the original order to predictions with x
ordered. (I'm not sure what the exact difference is, but presumably it's some sort of "leave one out" calculation.)
Here's the code:
cv.crit <-
if(is.na(cv)) NA
else {
r <- y - fit$ty[ox]
if(cv) {
ww <- wbar
ww[ww == 0] <- 1
r <- r / (1 - (lev[ox] * w)/ww[ox])
if(no.wgts) mean(r^2) else weighted.mean(r^2, w)
} else
(if(no.wgts) mean(r^2) else weighted.mean(r^2, w)) /
(1 - (df.offset penalty * df)/n)^2
}
On the 4th line, things look wrong. At this point with your unsorted data, I see
Browse[2]> y
[1] 47.142866 80.988466 104.809307 25.829283 63.410559 3.525909 32.920100 3.347533 18.544859 11.659599
and
Browse[2]> fit$ty[ox]
[1] 2.458502 5.274807 11.019719 17.995820 25.281214 34.165585 46.918576 63.054358 82.093996 103.915902
So it looks as though fit$ty[ox]
is based on ordered x
values, whereas y
is in the original order.
Unfortunately, the correction isn't obvious: ox
is TRUE
at this point, so it's not doing anything. What they really need to do is to sort y
in the same way as fit$ty
got sorted. But there are probably other problems elsewhere, because when I tried that, it wasn't enough to fix things.
You should report this to R Core, who maintain the stats
package.