Why does adding attributes to a dataframe take longer with large dataframes?-CodePudding

Let’s make a simple dataframe and give it an attribute “foo”:

orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE

“foo” is there:

attributes(orig)
#> $names
#> [1] "x1" "x2"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1
#> 
#> $foo
#> [1] TRUE

But if I reorder the columns, “foo” disappears

new <- orig[, c(2, 1)]
attributes(new)
#> $names
#> [1] "x2" "x1"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1

I could add it back with:

attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
attributes(new)
#> $names
#> [1] "x2" "x1"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1
#> 
#> $foo
#> [1] TRUE

But this operation is time consuming. Not in this case because it’s a one-row dataframe, but consider this case with 10,000,000 rows:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]

bench::mark(
  test = {
    attributes(new) <- utils::modifyList(attributes(orig), attributes(new))
  }
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 test         43.2ms   46.6ms      21.6    38.1MB     14.4

Of course, it doesn't take that much time to make this, but it is much longer than in the first case with one row (which takes only a few microseconds). It seems weird to me that the time needed to add a single attribute to a dataframe increases with the size of the dataframe. Am I missing something? Is there a more efficient way to add a list of "simple" attributes to a large dataframe?

Edit: looking for a solution with base R only

CodePudding user response：

It looks like R copies some aspects of the data frame container when you change the attributes.

orig <- data.frame(x1 = 1, x2 = 2)
# Get memory location of data frame
orig_mem_location  <- tracemem(orig) #  "<000001DEFA1FC938>"
# Get memory location of column x1
col_location  <- tracemem(orig$x1)

We can see that the memory location of the data frame changes when you define a new attribute:

attr(orig, "foo") <- TRUE 
# tracemem[0x000001defa1fc938 -> 0x000001dee758d2d0]:
tracemem(orig) == orig_mem_location # FALSE

Interestingly though the reference to the column remains the same - so it does not appear to be copying all the data, simply the container:

tracemem(orig$x1) == col_location # TRUE

You could avoid this by converting the data frame to a data.table and using data.table::setattr() to set attributes by reference:

library(data.table)
orig <- data.frame(x1 = 1, x2 = 2)
setDT(orig)

mem_location  <- tracemem(orig)

setattr(orig, "foo", TRUE)

tracemem(orig) == mem_location # TRUE

attr(orig, "foo") # TRUE

Additionally, with data.table you can change the column order by reference so you do not lose the attributes when you reorder the columns:

setcolorder(orig, c(2,1))
attr(orig, "foo") # TRUE

orig
#    x2 x1
# 1:  2  1

EDIT

This does not seem to uniformly happen in base R on all systems. I am running R 4.1 on Windows 10. However if you run it on R 4.0 on Ubuntu on rdrr.io, updating the attributes does not change the memory location.

One difference is that that interpreter is using the BLAS library - this is part of the sessionInfo() output:

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

This question suggests that might be the salient difference. Also it would be worth running the benchmarks to make sure it is a real difference and not just a quirk of tracemem() on various systems.

CodePudding user response：

The reason the computation time of copying all data.frame attributes scales with the size of the data.frame seems to be mainly due to the row.names attribute.

We can check that copying the row.names attribute is responsible for most of the computation time:

orig <- data.frame(x1 = rep(1, 1e7), x2 = rep(2, 1e7))
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]

microbenchmark::microbenchmark(
  all_attrs = { attributes(new) <- attributes(orig) },
  rownames = { attr(new, "row.names") <- attr(orig, "row.names") },
  foo = { attr(new, "foo") <- attr(orig, "foo") },
  times = 10,
  unit = "ms"  
)
#> Unit: milliseconds
#>       expr       min       lq       mean     median        uq        max neval
#>  all_attrs 60.477554 61.18414 64.3562408 61.9978505 67.117645  72.827139    10
#>   rownames 59.831147 61.21029 69.6012781 64.2950890 68.880676 106.280348    10
#>        foo  0.001043  0.00206  0.0072771  0.0087225  0.011206   0.015295    10

If we compare this to copying the foo attribute in the case of the small data.frame, the timing is (roughly) of the same order:

orig <- data.frame(x1 = 1, x2 = 2)
attr(orig, "foo") <- TRUE
new <- orig[, c(2, 1)]

microbenchmark::microbenchmark(
  foo = { attr(new, "foo") <- attr(orig, "foo") },
  unit = "ms"
)
#> Unit: milliseconds
#>  expr     min      lq       mean    median        uq      max neval
#>   foo 0.00115 0.00118 0.00146262 0.0012055 0.0012725 0.022368   100

To be efficient you can choose to only copy any custom defined attributes (instead of all data.frame attributes). For instance:

## replace only custom attributes
replace_attrs <- function(obj, new_attrs) {
  for(nm in setdiff(names(new_attrs), names(attributes(data.frame())))) {
    attr(obj, which = nm) <- new_attrs[[nm]]
  }
  return(obj)
}

new <- replace_attrs(new, attributes(orig))