Efficient way to calculate a specific index without using for loop in R-CodePudding

Suppose we have the following dataframe:

df <- data.frame("from_id" = c(1, 1, 1, 2, 2, 2, 3, 3, 3), 
                 "to_id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3), 
                 "x" = c(100, 150, 50, 200, 200, 250, 125, 225, 325),
                 "y" = c(1000, 2000, 3000, 1000, 2000, 3000, 1000, 2000, 3000))

IDs here represent origins and destinations. For each origin (from_id), I want to calculate the sum of the products of each "x" and "y" value corresponding to all destinations (to_id). The "x" value will vary for each origin-destination pair, while there is only one "y" value for a specific destination / ID. Every origin-destination pair will be included in the dataframe (no missing data). I'm currently accomplishing this as follows:

values <- c()
for (origin in unique(df$from_id)) {
  temp <- df %>%
    filter(from_id == origin)
  
  value <- sum(temp$x * temp$y)
  values <- c(values, value)
}
result <- cbind(unique(df$from_id), values)

To me this seems like a very clunky way to go about this, but I'm not sure about how to do this in a vectorized way. Moreover, the data I'm working with is very large and hence this takes a long time to run. Any suggestions on how to accomplish this without a for loop would be greatly appreciated.

CodePudding user response：

Simply using dplyr,

library(dplyr)
df %>%
  group_by(from_id) %>%
  summarise(values = sum(x * y)) 

  from_id  values
    <dbl>   <dbl>
1       1  550000
2       2 1350000
3       3 1550000

CodePudding user response：

solution 1

We can use external packages that are optimized for speed, such as collapse. All approaches in the benchmarks are wrapped in a function. Assumed is that output format (a matrix in OP's case) is not important.

library(collapse)
df |>
  fgroup_by(from_id) |>
  fsummarise(values = fsum(x, y))

#>   from_id  values
#> 1       1  550000
#> 2       2 1350000
#> 3       3 1550000

solution 2

Note that the for-loop can be rewritten to be more efficient by doing less. Here is an example of doing so using base R.

loop_new <- function(x){
  ids <- unique(x[["from_id"]])
  values <- vector(mode = "integer", length(ids)) #initialize
  for (i in seq_along(ids)) {
    values[i] = subset(x, from_id == ids[i], select = c(x, y)) |> 
      transform(z = x*y) |>
      with(sum(z))
  }
  cbind(ids, values)
}

benchmarks

collapse <- \(x){x |>fgroup_by(from_id) |> fsummarise(values = fsum(x, y))}
dplyr <- \(x){df %>% group_by(from_id) %>% fsummarise(values = sum(x * y))}
loop_old <- function(x){values=c();for (origin in unique(df$from_id)){temp <- df %>% filter(from_id == origin);value <- sum(temp$x * temp$y);values <- c(values, value)};cbind(unique(df$from_id), values)}
bench::mark(collapse(df), iterations = 10e3, time_unit = "ms")[c(1,3,5,7)] ; bench::mark(loop_new(df), iterations = 10e3, time_unit = "ms");  bench::mark(dplyr(df), iterations = 10e3)[c(1,3,5,7)] ; bench::mark(loop_old(df), iterations = 10e3)[c(1,3,5,7)]

  expression     median mem_alloc n_itr
  <bch:expr>   <bch:tm> <bch:byt> <int>
1 collapse(df)  0.059ms        0B  9995
2 loop_new(df)  0.671ms        0B  9958
3 dplyr(df)     2.34ms     2.99KB  9861
4 loop_old(df)  2.4ms      5.67KB  9841

Compared to @Parks dplyr approach and @tmako for_loop approach (all approaches wrapped in functions), collapse is much more efficient. Note that for-loops are not a bad approach at all - but could be slow if not carefully handled.