Apply cumsum but with conditions-CodePudding

I'm looking for a way to cumsum a column (by group) but with specific conditions. My conditions are :

Add a starting value
The sum is restricted by upper & lower limits

I've done it with an iterative solution but wonder if a more elegant solution could be found :

# Init my data.frame with random values
# Knowing my cumsum on "value" must be applied on the "group" column using the "ordering" column
random_values <- c(-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
values_probs <- c(0.7, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03)
x <- data.frame(
  group = rep(c("A", "B"), each = 10),
  ordering = rep(1:10, 2),
  value = sample(x = random_values, replace = T, prob = values_probs, size = 20)
)

# Create the "previous_conditional_sum" column
# Init the first row by group with the default starting value
starting_value <- 3
x <- x %>%
  group_by(group) %>%
  mutate(conditional_sum = NA_real_) %>%
  mutate(previous_conditional_sum = case_when(
    ordering == min(ordering) ~ starting_value,
    T ~ NA_real_
  )) %>%
  as.data.frame()

# Here my "custom" sum function applying lower & upper limits
custom_sum <- function(a,b, lower_limit = 3, upper_limit = 15) {
  x <- a b
  x <- min(x, upper_limit)
  x <- max(x, lower_limit)
  return(x)
}

# Iteratively sum using my custom function
for (i in 1:nrow(x)) {
  
  # Init the new value
  new_value <- custom_sum(x[i,"previous_conditional_sum",drop=T], x[i,"value",drop=T])
  x[i,"conditional_sum"] <- new_value
  
  # Set the previous_conditional_sum on next line if revelant
  if (is.na(x[i 1, "previous_conditional_sum",drop=T]) & i 1<=nrow(x)) {
    x[i 1, "previous_conditional_sum"] <- new_value
  }

}

CodePudding user response：

You can use purrr::accumulate(), and pass starting_value to the .init argument:

library(dplyr)
library(purrr)
set.seed(1)

random_values <- c(-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
values_probs <- c(0.7, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03)
x <- data.frame(
  group = rep(c("A", "B"), each = 10),
  ordering = rep(1:10, 2),
  value = sample(x = random_values, replace = T, prob = values_probs, size = 20)
)
starting_value <- 3

custom_sum <- function(a,b, lower_limit = 3, upper_limit = 15) {
  x <- a b
  x <- min(x, upper_limit)
  max(x, lower_limit)
}

x_out <- x %>% 
  group_by(group) %>% 
  mutate(
    # `.init` value will be appended to start of return vector;
    # remove using `[-1]`
    condsum = accumulate(value, custom_sum, .init = starting_value)[-1]
  ) %>% 
  as.data.frame()

x_out
#>    group ordering value condsum
#> 1      A        1    -1       3
#> 2      A        2    -1       3
#> 3      A        3    -1       3
#> 4      A        4     7      10
#> 5      A        5    -1       9
#> 6      A        6     7      15
#> 7      A        7     9      15
#> 8      A        8    -1      14
#> 9      A        9    -1      13
#> 10     A       10    -1      12
#> 11     B        1    -1       3
#> 12     B        2    -1       3
#> 13     B        3    -1       3
#> 14     B        4    -1       3
#> 15     B        5     3       6
#> 16     B        6    -1       5
#> 17     B        7     1       6
#> 18     B        8     0       6
#> 19     B        9    -1       5
#> 20     B       10     3       8

^{Created on 2022-03-05 by the reprex package (v2.0.1)}

If you use custom_sum() elsewhere, it makes sense to make it an external function as you have. If it's only used here though, you could instead use an anonymous function within accumulate():

x_out <- x %>% 
  group_by(group) %>% 
  mutate(conditional_sum = accumulate(
    value, 
    ~ max(min(.x   .y, 15), 3), 
    .init = 3
  )[-1]) %>% 
  as.data.frame()

CodePudding user response：

I wrote a helper function that I think captures what you want to do, in a 'vectorized' way (x is a vector, rather than an element of a vector)

f <- function(x, start, min, max) { 
    x = start   cumsum(x)
    pmin(pmax(x, min), max)
}

It is easily tested

set.seed(123)
x <- runif(10)
x
f(x, 0, 2, 4)

and using dplyr easy to apply to grouped data (using the value of x provided in the answer by @zephryl)

x |> group_by(group) |> mutate(condsum = f(value, 3, 3, 15))

The result differs from @zephryl, because in their answer the 'clamp' (restricting the range of the returned value) is applied in each iteration, and the clamped value used for the cumulative sum; in the implementation here the cumulative sum is calculated first, and then the values clamped to the specific range.

It is not clear which result matches what you want...

I was surprised that the answers were different. I thought accumulate() would just be an iterative way of calculating a vectorized cumsum(), but it is not -- this is a subtle difference, and I wonder if it is widely appreciated?