Home > Blockchain >  More efficient way of using group_by > mutate > slice
More efficient way of using group_by > mutate > slice

Time:11-15

I have a dataframe that looks like this

df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))

Month   ID     Region    Qty    Sales   Leads
April   11     East      120    1000    10
April   11     West      110    1100    12
May     12     North     110    900     9
May     10     East      110    1000    8
June    11     North     100    1000    6
June    11     East      90     800     5
June    11     West      70     650     4

I want a dataframe that looks like this

Month   ID     Qty     Sales   Leads   Region
April   11     230     2100    22      East
May     12     110     900     9       North
May     10     110     1000    8       East
June    11     260     2450    15      North

I am using a the following code

result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1) 

result$Region <- NULL

I have over 2 million such rows and it is taking forever to calculate the aggregate.

I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.

However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.

CodePudding user response:

summarize makes more sense to me than mutate and slice. This should save you some time.

library(dplyr)
result <- df %>%
  group_by(Month, ID) %>%
  summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
            Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups:   Month [3]
#   Month    ID   Qty Sales Leads Region
#   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
# 1 April    11   230  2100    22 East  
# 2 June     11   260  2450    15 North 
# 3 May      10   110  1000     8 East  
# 4 May      12   110   900     9 North 

CodePudding user response:

We can apply generic speed-up strategies:

  1. Do less
  2. Choose an appropriate back-end
  3. Use appropriate data structures

dplyr provides syntactic sugar for data manipulation, but may not be the most efficient when it comes to handling large data sets.

solution 1

We could rewrite the code slightly to be more efficient by using the collapse package, which provides a C interface to dplyr functions. It prepends dplyr functions with f, with one exception fsubset which is similar to dplyr::filter (or base R subset).

library(collapse)
df |>
    fgroup_by(Month, ID) |>
    fsummarise(Qty = fsum(Qty),
               Sales = fsum(Sales),
               Leads = fsum(Leads),
               Region = fsubset(Region, 1L),
               keep.group_vars = T) |>
    as_tibble() # optional
#> # A tibble: 4 x 6
#>   Month    ID   Qty Sales Leads Region
#>   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
#> 1 April    11   230  2100    22 East  
#> 2 June     11   260  2450    15 North 
#> 3 May      10   110  1000     8 East  
#> 4 May      12   110   900     9 North 

Where |> (Requires R version > 3.5) is a slightly faster pipe than %>%. Its result is ungrouped.

solution 2

data.table is often lauded for its speed, memory use and utility. The easiest conversion from existing dplyr code to use data.table is using the dtplyr package, which ships with tidyverse. We can convert it by adding two lines of code.

library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
      group_by(Month, ID) %>%
      summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
                Region = first(Region)) %>%
      as_tibble() # or data.table()

Note that this results is an ungrouped data.frame at the end.

Benchmarks

Approaches are put in wrapper functions for brevity. dplyr here is www's approach.

bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
            time_unit = "ms", iterations = 200)[c(1, 3,5,7)]
# A tibble: 3 x 4
  expression median mem_alloc n_itr
  <bch:expr>  <dbl> <bch:byt> <int>
1 collapse    0.316        0B   200
2 dplyr       5.42     8.73KB   195
3 dtplyr      6.67   120.21KB   196

We can see that collapse is more memory efficient, and significantly faster compared to dplyr. dtplyr approach is included here, as its time complexity is different than that of dplyr.

Further optimizations are still possible - for example by converting numeric columns into integer columns.

side-note

The Region column loses information in OP's required approach, I.E East and North for April and June in ID 11 is misleading.

  • Related