Home > OS >  Efficiency discrepancies between different data.frame modification operations
Efficiency discrepancies between different data.frame modification operations

Time:02-26

I'm observing the following discrepancy in the efficiency of data.frame modification in R:

microbenchmark::microbenchmark(
  mtcars$mpg[1] <- 0, 
  mtcars[["mpg"]][1] <- 0, 
  mtcars[1,"mpg"] <- 0
)
Unit: microseconds
                    expr    min     lq     mean  median      uq    max neval
      mtcars$mpg[1] <- 0  4.400  4.801  5.16010  5.0020  5.3005 14.001   100
 mtcars[["mpg"]][1] <- 0  9.600 10.201 11.08797 10.5010 10.8015 61.001   100
   mtcars[1, "mpg"] <- 0 12.702 13.451 14.28102 13.8015 14.1020 47.701   100

Moreover, the first method with $ seems to scale up a lot better than [[ and [, which seem to work in linear time.

f <- function(nrow){
  df <- mtcars[sample(1:32,nrow,TRUE), ]
  microbenchmark::microbenchmark(
    df$mpg[1] <- 0, 
    df[["mpg"]][1] <- 0, 
    df[1,"mpg"] <- 0
  )
}

> f(1e5)
Unit: microseconds
                expr     min       lq      mean   median       uq     max neval
      df$mpg[1] <- 0   4.801   5.7505  41.92301   6.5505  12.2515 253.501   100
 df[["mpg"]][1] <- 0 140.401 146.4510 162.82191 154.8005 165.3005 267.400   100
   df[1, "mpg"] <- 0 144.801 151.5005 167.17197 159.9010 171.7010 277.102   100
> f(1e6)
Unit: microseconds
                expr     min       lq     mean   median        uq     max neval
      df$mpg[1] <- 0   5.601   10.551  733.995   18.251  918.6015 36519.5   100
 df[["mpg"]][1] <- 0 908.402 1013.052 2420.035 1278.751 1704.5505 52940.7   100
   df[1, "mpg"] <- 0 846.401 1018.101 2356.817 1332.900 1628.1510 57710.4   100

What is driving this behaviour? Most R texts I have read seems to treat mtcars$mpg and mtcars[["mpg"]] as interchangeable, so the fact that they modify differently is quite confusing.

Edit: I ran a quick benchmark based on Karolis' answer below. The method [[.data.frame seems to be the culprit. The performances are comparable if the data.frame is coerced into list.

> mtlist <- as.list(mtcars)
> microbenchmark::microbenchmark(
    mtlist$mpg[1] <- 1,
    mtlist[["mpg"]][1] <- 1
  )
Unit: nanoseconds
                    expr min  lq    mean median     uq   max neval
      mtlist$mpg[1] <- 1 801 900 1045.96    901 1051.0  5501   100
 mtlist[["mpg"]][1] <- 1 701 801 1191.16    851  951.5 19401   100 

CodePudding user response:

There is no $.data.frame operator - so the $ for the list is used instead.

`$.data.frame`
Error: object '$.data.frame' not found

While both [.data.frame, and [[.data.frame are functions implemented in R code.

`[.data.frame`
function...

`[[.data.frame`
function...

You can also read about this in help('[.data.frame'):

> [...] There is no ‘data.frame’ method for ‘$’, so ‘x$name’ uses the
  default method which treats ‘x’ as a list [...]
  • Related