Home > front end >  R: smooth.spline LOOCV-error depends on order of datapoints?
R: smooth.spline LOOCV-error depends on order of datapoints?

Time:01-27

I wanted to fit a smoothing spline to some data and I noticed that the internally computed LOOCV-error seems to depend on whether the data is unordered or not. Concretely, I only get the expected result when the data is ordered.

I don't see why this is to be expected? Any help?

set.seed(0)
x <- seq(1:10)
y <- x^2   rnorm(10,0,2)

fit.ss <- smooth.spline(x=x, y=y,  cv=TRUE)
cat("CV ordered: ",format(fit.ss$cv.crit))
# CV ordered:  13.46173

xu <- sample(x)
yu <- y[xu]
fit.ss.u <- smooth.spline(x=xu, y=yu,  cv=TRUE)
cat("CV unorderd: ",format(fit.ss.u$cv.crit))
# CV unorderd:  65552.74

spar.opt <- fit.ss$spar
preds <- rep(NA, 10)
for (i in 1:10){
  ss <- smooth.spline(x=x[-i], y=y[-i],  cv=TRUE, spar=spar.opt)
  preds[i] <- predict(ss,x=x[i])$y
}
cat("CV manual: ",format(mean((preds - y)**2)))
# CV manual:  13.49424

CV ordered and CV manual are (almost) the same and as expected, whereas the unordered version is completely off.

Note that this is a duplicate of https://stats.stackexchange.com/q/561802/213798, where I don't seem to get any input.

CodePudding user response:

Looks like a bug in smooth.spline. When it calculates cv.crit internally, it compares observations in the original order to predictions with x ordered. (I'm not sure what the exact difference is, but presumably it's some sort of "leave one out" calculation.)

Here's the code:

cv.crit <-
    if(is.na(cv)) NA
    else {
        r <- y - fit$ty[ox]
        if(cv) {
            ww <- wbar
            ww[ww == 0] <- 1
            r <- r / (1 - (lev[ox] * w)/ww[ox])
            if(no.wgts) mean(r^2) else weighted.mean(r^2, w)
        } else
            (if(no.wgts) mean(r^2) else weighted.mean(r^2, w)) /
                (1 - (df.offset   penalty * df)/n)^2
    }

On the 4th line, things look wrong. At this point with your unsorted data, I see

Browse[2]> y
 [1]  47.142866  80.988466 104.809307  25.829283  63.410559   3.525909  32.920100   3.347533  18.544859  11.659599

and

Browse[2]> fit$ty[ox]
 [1]   2.458502   5.274807  11.019719  17.995820  25.281214  34.165585  46.918576  63.054358  82.093996 103.915902

So it looks as though fit$ty[ox] is based on ordered x values, whereas y is in the original order.

Unfortunately, the correction isn't obvious: ox is TRUE at this point, so it's not doing anything. What they really need to do is to sort y in the same way as fit$ty got sorted. But there are probably other problems elsewhere, because when I tried that, it wasn't enough to fix things.

You should report this to R Core, who maintain the stats package.

  •  Tags:  
  • Related