I'm looking for a way to cumsum
a column (by group) but with specific conditions.
My conditions are :
- Add a starting value
- The sum is restricted by upper & lower limits
I've done it with an iterative solution but wonder if a more elegant solution could be found :
# Init my data.frame with random values
# Knowing my cumsum on "value" must be applied on the "group" column using the "ordering" column
random_values <- c(-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
values_probs <- c(0.7, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03)
x <- data.frame(
group = rep(c("A", "B"), each = 10),
ordering = rep(1:10, 2),
value = sample(x = random_values, replace = T, prob = values_probs, size = 20)
)
# Create the "previous_conditional_sum" column
# Init the first row by group with the default starting value
starting_value <- 3
x <- x %>%
group_by(group) %>%
mutate(conditional_sum = NA_real_) %>%
mutate(previous_conditional_sum = case_when(
ordering == min(ordering) ~ starting_value,
T ~ NA_real_
)) %>%
as.data.frame()
# Here my "custom" sum function applying lower & upper limits
custom_sum <- function(a,b, lower_limit = 3, upper_limit = 15) {
x <- a b
x <- min(x, upper_limit)
x <- max(x, lower_limit)
return(x)
}
# Iteratively sum using my custom function
for (i in 1:nrow(x)) {
# Init the new value
new_value <- custom_sum(x[i,"previous_conditional_sum",drop=T], x[i,"value",drop=T])
x[i,"conditional_sum"] <- new_value
# Set the previous_conditional_sum on next line if revelant
if (is.na(x[i 1, "previous_conditional_sum",drop=T]) & i 1<=nrow(x)) {
x[i 1, "previous_conditional_sum"] <- new_value
}
}
CodePudding user response:
You can use purrr::accumulate()
, and pass starting_value
to the .init
argument:
library(dplyr)
library(purrr)
set.seed(1)
random_values <- c(-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
values_probs <- c(0.7, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03)
x <- data.frame(
group = rep(c("A", "B"), each = 10),
ordering = rep(1:10, 2),
value = sample(x = random_values, replace = T, prob = values_probs, size = 20)
)
starting_value <- 3
custom_sum <- function(a,b, lower_limit = 3, upper_limit = 15) {
x <- a b
x <- min(x, upper_limit)
max(x, lower_limit)
}
x_out <- x %>%
group_by(group) %>%
mutate(
# `.init` value will be appended to start of return vector;
# remove using `[-1]`
condsum = accumulate(value, custom_sum, .init = starting_value)[-1]
) %>%
as.data.frame()
x_out
#> group ordering value condsum
#> 1 A 1 -1 3
#> 2 A 2 -1 3
#> 3 A 3 -1 3
#> 4 A 4 7 10
#> 5 A 5 -1 9
#> 6 A 6 7 15
#> 7 A 7 9 15
#> 8 A 8 -1 14
#> 9 A 9 -1 13
#> 10 A 10 -1 12
#> 11 B 1 -1 3
#> 12 B 2 -1 3
#> 13 B 3 -1 3
#> 14 B 4 -1 3
#> 15 B 5 3 6
#> 16 B 6 -1 5
#> 17 B 7 1 6
#> 18 B 8 0 6
#> 19 B 9 -1 5
#> 20 B 10 3 8
Created on 2022-03-05 by the reprex package (v2.0.1)
If you use custom_sum()
elsewhere, it makes sense to make it an external function as you have. If it's only used here though, you could instead use an anonymous function within accumulate()
:
x_out <- x %>%
group_by(group) %>%
mutate(conditional_sum = accumulate(
value,
~ max(min(.x .y, 15), 3),
.init = 3
)[-1]) %>%
as.data.frame()
CodePudding user response:
I wrote a helper function that I think captures what you want to do, in a 'vectorized' way (x
is a vector, rather than an element of a vector)
f <- function(x, start, min, max) {
x = start cumsum(x)
pmin(pmax(x, min), max)
}
It is easily tested
set.seed(123)
x <- runif(10)
x
f(x, 0, 2, 4)
and using dplyr easy to apply to grouped data (using the value of x
provided in the answer by @zephryl)
x |> group_by(group) |> mutate(condsum = f(value, 3, 3, 15))
The result differs from @zephryl, because in their answer the 'clamp' (restricting the range of the returned value) is applied in each iteration, and the clamped value used for the cumulative sum; in the implementation here the cumulative sum is calculated first, and then the values clamped to the specific range.
It is not clear which result matches what you want...
I was surprised that the answers were different. I thought accumulate()
would just be an iterative way of calculating a vectorized cumsum()
, but it is not -- this is a subtle difference, and I wonder if it is widely appreciated?