Home > OS >  R Knitr Why is the cache invalidated by copying?
R Knitr Why is the cache invalidated by copying?

Time:02-21

Question

It seems the knitr cache becomes invalidated by copying the relevant files (.rmd script and cache directory) to another computer.

  1. Why is that so and
  2. how can I work around this?

Details

I do various lengthy calculations on two computers. I thought the following procedure could work:

  1. Knit a first version of a report on machine A. (includes some lengthy calculations)
  2. Copy the files created, i.e. the script and the cache directory, to machine B.
  3. Continue editing the report on machine B (without recalculations because everything is cached).

This does not work, after copying the files to B, "knit" performs a full recalculation. This is even the case before any editing of the script was performed, i.e. just the act of copying from A to B seems enough to invalidate the cache.

Why is a full recalculation on B performed? As I understood it the caching mechanism boils down to creating and comparing a hash. I had hoped that after copying the hash would remain unchanged.

Is there something else I should copy in addition? Or is there any other way I can make the procedure above work?

Example

Any trivial script works as an example such as the one below:

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE)
```
Bla Bla
```{r test}
tmp = sort(runif(1e7))
```

CodePudding user response:

I don't know the details of why that happens, but the workaround is easy: save values to files explicitly, and read them back in. You can use

saveRDS(x, "x.rds")

to save the variable x to a file named x.rds, and then

x <- readRDS("x.rds")

to read it back in. If you want to get fancy, you can check for the existence of x.rds using file.exists("x.rds") and do the full calculation followed by saveRDS if that returns FALSE, otherwise just read the data.

EDITED TO ADD: If you really want to know the answer to your first question, one possible approach would be to copy the folder back from the 2nd computer to the 1st, and see if it works back there. If not, do a binary compare of the original and twice copied directories and see what has changed.

If it does work, it might simply be different RNGkind() settings on the two computers: it's pretty common to have the buggy sample.kind = "Rounding" saved. Not sure that caching would use this. Or perhaps different package versions or R versions: when I updated knitr the cache was invalidated.

MORE additions:

If you want to see what has changed, then turn on debugging on the digest::digest function, and call knitr::knit("src.Rmd"). digest() is called for each cached chunk, and passed a large list in its object argument. It should return the same hash value if the list is the same, so you'll want to save those objects, and compare them between the two computers. For example, with your toy example above, I get this passed as object:

list(eval = TRUE, echo = TRUE, results = "markup", tidy = FALSE, 
    tidy.opts = NULL, collapse = FALSE, prompt = FALSE, comment = "##", 
    highlight = TRUE, size = "normalsize", background = "#F7F7F7", 
    strip.white = TRUE, cache = 3, cache.path = "cache/", cache.vars = NULL, 
    cache.lazy = TRUE, dependson = NULL, autodep = FALSE, fig.keep = "high", 
    fig.show = "asis", fig.align = "default", fig.path = "figure/", 
    dev = "png", dev.args = NULL, dpi = 72, fig.ext = NULL, fig.width = 7, 
    fig.height = 7, fig.env = "figure", fig.cap = NULL, fig.scap = NULL, 
    fig.lp = "fig:", fig.subcap = NULL, fig.pos = "", out.width = NULL, 
    out.height = NULL, out.extra = NULL, fig.retina = 1, external = TRUE, 
    sanitize = FALSE, interval = 1, aniopts = "controls,loop", 
    warning = TRUE, error = TRUE, message = TRUE, render = NULL, 
    ref.label = NULL, child = NULL, engine = "R", split = FALSE, 
    purl = TRUE, label = "test", code = "tmp = sort(runif(1e7))", 
    75L)
  • Related