Why tibble performs slower than data.frame at row-wise comparison-CodePudding

I am in the process of converting an old codebase to tidyverse, and I noticed a performance decrease in a particular step; As I am now using readr (read_delim) to read my data, I end up with a tibble instead of the prior base R data.frame (read.delim) — which is good.

Anyway, when using tibble on a row-wise comparison, the computing time decrease at roughly factor 10, compared to a regular data.frame.

Here is my code:

library(tidyverse)

# Data
df <- tribble(
  ~x_pos, ~y_pos,
  0.0,  5.0,
  NA,   NA,
  0.1,  0.9,
  1.1,  1.5,
  1.7,  2.0,
  3.2,  1.0,
  4.0,  1.5,
  4.1,  5.0,
)

# Defining Regions of interest
roi_set_top <- list(
  roi_list = list(
    roi1 = list(
      hit_name = "left",
      x1 = 1.0,
      y1 = 1.0,
      x2 = 2.0,
      y2 = 2.0
    ),
    roi2 = list(
      hit_name = "right",
      x1 = 3.0,
      y1 = 1.0,
      x2 = 4.0,
      y2 = 2.0
    )
  )
)

# ⚡️ UNCOMMENT THIS LINE this line to convert the `tibble` to a `data.frame` and source the file again
# df <- as.data.frame(df)

start.time <- Sys.time()

for (bench in 1:1000) {
  roi_vector <- rep("NO EVAL", times = nrow(df))
  
  # loop over rows
  for (i in 1:nrow(df)) {
    
    # loop over the aoilist
    for (roi in roi_set_top$roi_list) {
      
      # check if either x or y is NA (or both) if so return NA
      if (is.na(df[i, "x_pos"]) || is.na(df[i, "y_pos"])) {
        roi_vector[i] <- "No X/Y"
        break
      }
      
      # check the hit area
      if (df[i, "x_pos"] >= roi$x1 && df[i, "y_pos"] >= roi$y1 &&
          df[i, "x_pos"] <= roi$x2 && df[i, "y_pos"] <= roi$y2) {
        roi_vector[i] <- roi$hit_name
        break
      }
      
      # Finally, if current row’s x and y is neither NA nor in hit range assign Outside ROI
      roi_vector[i] <- "Outside ROI"
    }
  }
}

end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

Comparison

When you source the code as is, it will take about 10 times longer compared to when you uncomment the line with ⚡️, converting it from tibble to data.frame.

I could gain my performance back if I would work would extract the vectors of the data.farme as such: x_pos <- df$x_pos; y_pos <- df$x_pos and use the vetors instead of the df within the loop. However, I got one fundamental Question

Questions

Why tibble performs slower on row-wise comparison compared with base R data.frame?

As follow-up on best practice style; It seems to be a bad practice to work with a df when one only needs to work with a vector. Thus one should constantly iterate through vectors instead of the column within a df?

CodePudding user response：

The main reason is that tibbles return tibbles when subsetted, whereas dataframes sometimes return vectors. In your example, this shows up in evaluating df[i, "x_pos"], which is a tibble if df is a tibble, but it's a numeric scalar if df is a dataframe. This makes calculations like is.na(df[i, "x_pos"]) much slower.

You'll get a bit more speed by adding drop = TRUE each time you really do want a vector or scalar (I saw a 25% reduction in time taken), but a better idea is to make the conversion to a vector outside the loop to avoid all those individual accesses within the tibble. For example this code:

start.time <- Sys.time()

for (bench in 1:1000) {
  roi_vector <- rep("NO EVAL", times = nrow(df))
  # loop over rows
  x_pos <- df$x_pos
  y_pos <- df$y_pos
  for (i in 1:nrow(df)) {
    # loop over the aoilist
    for (roi in roi_set_top$roi_list) {
      # check if either x or y is NA (or both) if so return NA
      if (is.na(x_pos[i]) || is.na(y_pos[i])) {
        roi_vector[i] <- "No X/Y"
        break
      }
      # check the hit area
      if (x_pos[i] >= roi$x1 && y_pos[i] >= roi$y1 &&
          x_pos[i] <= roi$x2 && y_pos[i] <= roi$y2) {
        roi_vector[i] <- roi$hit_name
        break
      }
      # Finally, if current row’s x and y is neither NA nor in hit range assign Outside ROI
      roi_vector[i] <- "Outside ROI"
    }
  }
}
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

was about 60 times faster than your original code on my system.