Suppose we have the following dataframe:
df <- data.frame("from_id" = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"to_id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"x" = c(100, 150, 50, 200, 200, 250, 125, 225, 325),
"y" = c(1000, 2000, 3000, 1000, 2000, 3000, 1000, 2000, 3000))
IDs here represent origins and destinations. For each origin (from_id), I want to calculate the sum of the products of each "x" and "y" value corresponding to all destinations (to_id). The "x" value will vary for each origin-destination pair, while there is only one "y" value for a specific destination / ID. Every origin-destination pair will be included in the dataframe (no missing data). I'm currently accomplishing this as follows:
values <- c()
for (origin in unique(df$from_id)) {
temp <- df %>%
filter(from_id == origin)
value <- sum(temp$x * temp$y)
values <- c(values, value)
}
result <- cbind(unique(df$from_id), values)
To me this seems like a very clunky way to go about this, but I'm not sure about how to do this in a vectorized way. Moreover, the data I'm working with is very large and hence this takes a long time to run. Any suggestions on how to accomplish this without a for loop would be greatly appreciated.
CodePudding user response:
Simply using dplyr
,
library(dplyr)
df %>%
group_by(from_id) %>%
summarise(values = sum(x * y))
from_id values
<dbl> <dbl>
1 1 550000
2 2 1350000
3 3 1550000
CodePudding user response:
solution 1
We can use external packages that are optimized for speed, such as collapse
. All approaches in the benchmarks are wrapped in a function. Assumed is that output format (a matrix in OP's case) is not important.
library(collapse)
df |>
fgroup_by(from_id) |>
fsummarise(values = fsum(x, y))
#> from_id values
#> 1 1 550000
#> 2 2 1350000
#> 3 3 1550000
solution 2
Note that the for-loop can be rewritten to be more efficient by doing less. Here is an example of doing so using base R.
loop_new <- function(x){
ids <- unique(x[["from_id"]])
values <- vector(mode = "integer", length(ids)) #initialize
for (i in seq_along(ids)) {
values[i] = subset(x, from_id == ids[i], select = c(x, y)) |>
transform(z = x*y) |>
with(sum(z))
}
cbind(ids, values)
}
benchmarks
collapse <- \(x){x |>fgroup_by(from_id) |> fsummarise(values = fsum(x, y))}
dplyr <- \(x){df %>% group_by(from_id) %>% fsummarise(values = sum(x * y))}
loop_old <- function(x){values=c();for (origin in unique(df$from_id)){temp <- df %>% filter(from_id == origin);value <- sum(temp$x * temp$y);values <- c(values, value)};cbind(unique(df$from_id), values)}
bench::mark(collapse(df), iterations = 10e3, time_unit = "ms")[c(1,3,5,7)] ; bench::mark(loop_new(df), iterations = 10e3, time_unit = "ms"); bench::mark(dplyr(df), iterations = 10e3)[c(1,3,5,7)] ; bench::mark(loop_old(df), iterations = 10e3)[c(1,3,5,7)]
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 collapse(df) 0.059ms 0B 9995
2 loop_new(df) 0.671ms 0B 9958
3 dplyr(df) 2.34ms 2.99KB 9861
4 loop_old(df) 2.4ms 5.67KB 9841
Compared to @Parks dplyr
approach and @tmako for_loop
approach (all approaches wrapped in functions), collapse
is much more efficient. Note that for-loops are not a bad approach at all - but could be slow if not carefully handled.