Home > Enterprise >  Conditional sum with mutate (dplyr)
Conditional sum with mutate (dplyr)

Time:07-29

I want to achieve something similar to the excel sumif. It is quite easy if the condition is equal to something, but not for greater or smaller because then we can't use the group_by() approach. For different reasons I can't use arrange() and then using cumsum(). The data in the example is ordered, but it is not always ordered.

I am giving an small example, in reality I have more conditions:

data <- data.frame(a = c(2022, 2021, 2020), b = c(10, 20, 10))

get_sum  <- function(x, year){
  out <- filter(x, b <= year) %>%
    summarise(cum = sum(b)) %>% 
    pull()
  return(out)
}


data %>% 
  mutate(cum = get_sum(x = ., year = b))

Steps: for the first row mutate, it filters the whole data.frame by values equal or less than 2022, then summarize it by making the sum of the filtered data.frame column b: 10 20 10. for the second row do the same and now filter the value of b equal or smaller than 2021 being the mutate output: 20 10, and so on.

The output should look something like this:

     a  b cum
1 2022 10  40
2 2021 20  30
3 2020 10  10

I would like to know why doesn't work my code.

CodePudding user response:

We could use map from purrr if we want to stay within tidyverse. We need map to input individual values of a into the function. In your small example, i.e.

library(dplyr)
library(purrr)

data |>
  mutate(cum = map(a, ~ sum(b[a <= .])))

Update. To make your own function work you could apply a similar logic using map, using cur_data() and fix a small typo within get_sum (a <= year rather than b <= year):

get_sum <- function(df, year) {

  df |>
    filter(a <= year) |>
    summarise(cum = sum(b)) |>
    pull()

}

data |>
  mutate(cum = map(a, ~ get_sum(df = cur_data(), year = .)))

Output:

     a  b cum
1 2022 10  40
2 2021 20  30
3 2020 10  10

CodePudding user response:

Maybe we can use base R for this. sapply will go through all years and sum column b accordingly.

data$cum <- sapply(unique(data$a), function(x) sum(subset(data, a <= x)["b"]))
data
     a  b cum
1 2022 10  40
2 2021 20  30
3 2020 10  10

CodePudding user response:

You could use a reverse cumsum like this:

data <- data.frame(a = c(2022, 2021, 2020), b = c(10, 20, 10))

library(dplyr)
data %>%
  mutate(cum = rev(cumsum(b)))
#>      a  b cum
#> 1 2022 10  40
#> 2 2021 20  30
#> 3 2020 10  10

Created on 2022-07-28 by the reprex package (v2.0.1)

  • Related