The actual question

Given the data following data:

library(dplyr)

df <- tibble(v1 = 1:6, cond = c(1, 0, 1, 1, 0, 1))

## # A tibble: 6 × 2
##      v1  cond
##   <int> <dbl>
## 1     1     1
## 2     2     0
## 3     3     1
## 4     4     1
## 5     5     0
## 6     6     1

I want to calculate a mixture of cumulative sum and cumulative product. If cond = 1 calcutate the sum of current v1 and the results of the preceding calculations. If cond = 0 calculate the product of current v1 and the results of the preceeding calcultions.

The desired result should look like this:

## # A tibble: 6 × 3
##      v1  cond cum_prodsum
##   <int> <dbl>       <int>
## 1     1     1           1
## 2     2     0           2
## 3     3     1           5
## 4     4     1           9
## 5     5     0          45
## 6     6     1          51

In SPSS this is the code I would use:

COMPUTE cum_prodsum = 0.
IF($casenum = 1 & cond = 1) cum_prodsum = v1.
IF($casenum > 1 & cond = 0) cum_prodsum = lag(cum_prodsum) * v1
IF($casenum > 1 & cond = 1) cum_prodsum = lag(cum_prodsum)   v1.

But how can this be done in R?

Sounds like a silly task that noone never ever would need to do? Yeah, it probably is. But think of it as a simple example for a whole group of problems where the calculation of the current row depends on the calculation results of the preceding rows.

Some information for (former) SPSS users working with R (not part of the question)

When I used to work with SPSS I often used a combination of the IF and LAG command in order to do some common tasks, such as slicing the data and keeping only the first row of each group. When I started working with R, I quickly learned, that for those common task, R usually comes with some handy functions, so that there is no need to program own routines with the lag function. And even for not so common task, a little research often leads to solution without iterating through the data.

In the end the situations where I think “Well, I know how to do it in SPSS with the LAG command. But how could I do it in R?” are very very rare. The dplyr package from R comes with a lag function but it works different, so that the naive approach substituting SPSS-LAG by the R-lag would not work.

Difference between LAG from SPSS and dplyr:lag from R

Let’ say you have the following data with just one column:

library(dplyr)

df <- tibble(v1 = 1:6)

## # A tibble: 6 × 1
##      v1
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6

When computing a new a variable in SPSS with lag, SPSS processes cases sequentially from top to bottom. The results from the calculation of the preceding rows can be used for computing the current row.

COMPUTE lagsum_spss = v1.
IF ($casenum > 1) lagsum_spss = lagsum_spss   LAG(lagsum_spss).

Which results in:

## # A tibble: 6 × 2
##      v1 lagsum_spss
##   <int>       <int>
## 1     1           1
## 2     2           3
## 3     3           6
## 4     4          10
## 5     5          15
## 6     6          21

The dplyr::lag function on the other hand, is a vectorised function, which applies the calculations to all elements in a vector simultaneously. So when I try mimicking the SPSS behavior in R with the mutate and lag functions I get a different result:

df %>% 
  mutate(lagsum_r = v1,
         lagsum_r = lagsum_r   lag(lagsum_r, default = 0))

## # A tibble: 6 × 3
##      v1 lagsum_spss lagsum_r
##   <int>       <int>    <dbl>
## 1     1           1        1
## 2     2           3        3
## 3     3           6        5
## 4     4          10        7
## 5     5          15        9
## 6     6          21       11

The fourth row, for example is calculates as this: lagsum_spss[4] = 4 6 and lagsum_r[4] = 4 3.

So how can we reproduce this calculation in R? Well in this case it is quite simple:

df %>%
  mutate(cumsum = cumsum(v1))

## # A tibble: 6 × 3
##      v1 lagsum_spss cumsum
##   <int>       <int>  <int>
## 1     1           1      1
## 2     2           3      3
## 3     3           6      6
## 4     4          10     10
## 5     5          15     15
## 6     6          21     21

See, no need for lag, this time.

OK OK, but what if I want to sum only values from cases that meet a certain condition, a conditional cumsum if you say so?

Example data set:

df <- tibble(v1 = 1:6, cond = c(1, 0, 1, 1, 0, 1))
df

## # A tibble: 6 × 2
##      v1  cond
##   <int> <dbl>
## 1     1     1
## 2     2     0
## 3     3     1
## 4     4     1
## 5     5     0
## 6     6     1

The SPSS code would look like this:

COMPUTE cond_cumsum = 0.
IF($casenum = 1 & cond = 1) cond_cumsum = v1.
IF($casenum > 1 & cond = 0) cond_cumsum = lag(cond_cumsum).
IF($casenum > 1 & cond = 1) cond_cumsum = lag(cond_cumsum)   v1.

So how this can be done in R? Well the solution is also pretty easy:

df %>%
  mutate(cond_cumsum = cumsum(v1 * cond))

## # A tibble: 6 × 3
##      v1  cond cond_cumsum
##   <int> <dbl>       <dbl>
## 1     1     1           1
## 2     2     0           1
## 3     3     1           4
## 4     4     1           8
## 5     5     0           8
## 6     6     1          14

For a task, where I think it is unavoidable to to iterate through the data rows, see the question above.

CodePudding user response：

So what we want to do is basicly this: Start with the two first elements of a vector as input, do stuff with it, use that outcome as the first input and the next vector element as the second input, do the same stuff again, use that outcome as new first input … and so on. If you heard about the Reduce (base R) or reduce and accumulate (purrr) functions, this may sound familiar.

This is an illustration from the purrr Cheat Sheet of how the accumulate function works:

Lets first think about the function that we want to apply:

first we want to check if cond is 0 or 1
if cond is 1 then sum v1 from the current row with the outcome from the preceeding step.
if not, then multiply v1 from the current row with the outcome from the preceeding step.

So we program this function in R:

function(last_result, i){ # i stands for the row index.
  if(condition[i]) last_result   v1[i]
  else last_result * v1[i]
}

Now lets think about the first row, since we have no “last_result” which we could throw into that function. Following the idea of a cumulative sum, and a cumulative product. The value should be cumsum(v1[1]) if cond[1] is 1 or prodsum(v1[1]) if not. In both cases these functions will return v1[1]. So this is our initial value for the first row.

OK now, lets put this together for the accumulate function from the purrr package:

library(purrr)

df %>%
  mutate(
    cum_prodsum = accumulate(
      .x = row_number()[-1], # apply the funtion on all rows, except the first one.
      .init = v1[1],         # initial value for the first row.
      .f = function(last_result, i) {
        if (cond[i]) last_result   v1[i]
        else last_result * v1[i]
      }
    ))

## # A tibble: 6 × 3
##      v1  cond cum_prodsum
##   <int> <dbl>       <int>
## 1     1     1           1
## 2     2     0           2
## 3     3     1           5
## 4     4     1           9
## 5     5     0          45
## 6     6     1          51

And this is it. For similar type of problems where the calculation of a value depends on the calculations of the preceeding values, just adjust the function within the accumulate command to your needs.

CodePudding user response：

library(dplyr)

df <- 
  tibble(v1 = as.numeric(1:6), v2 = c(1, 0, 1, 1, 0, 1))

df %>%
  mutate(output = case_when(v2 == 1 ~ cumsum(v1),
                         v2 == 0 ~ cumprod(v1)))