Home > Software engineering >  R: lagged "cumulative" difference between two values
R: lagged "cumulative" difference between two values

Time:11-10

I have a data.frame df with many groups (series) where the data area are presented annually. I'm trying to create a new column where diff is the difference between area in row 1 and row 2. But I need continue subtracting from the "new" difference. This needs to be done in descending order by year for each series.

df<-
structure(list(series = c("A218t23", "A218t23", "A218t23", "A218t23", 
"A218t23", "A218t23", "A218t23", "A218t23", "A218t23"), year = 2018:2010, 
    area = c(16409.3632611811, 274.5866082, 293.8540619, 323.0603775, 
    544.7366938, 108.0737561, 134.8579038, 143.14125, 167.8244576
    )), row.names = c(NA, -9L), groups = structure(list(series = "A218t23", 
    .rows = structure(list(1:9), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

My desired output would look like this:

So in the above data, I would ha

Hence, 16409-275=16135 then, 16135-294=15841 and so on.

The code I've been using:

df_diffs <- df %>%
   dplyr::group_by(series) %>%
   dplyr::mutate(diff = area - dplyr::lag(area, default=0, order_by = desc(year)))

However, this just returns the lagged difference between rows in the area column. The result I'm looking for is a "cumulative" or running difference. I've checked out RcppRoll and some other SO posts, but no luck. Ideally, I could keep this all within a piping framework since I have other functions going on. Bonus points if there's a way to replace the NA in the first row with the corresponding area value for that year.

Suggestions very much appreciated!

CodePudding user response:

Another option, using Reduce()


 df %>%
  group_by(series) %>%
  mutate(diff = Reduce("-", area, accumulate = T))

# A tibble: 9 × 4
# Groups:   series [1]
  series   year   area   diff
  <chr>   <int>  <dbl>  <dbl>
1 A218t23  2018 16409. 16409.
2 A218t23  2017   275. 16135.
3 A218t23  2016   294. 15841.
4 A218t23  2015   323. 15518.
5 A218t23  2014   545. 14973.
6 A218t23  2013   108. 14865.
7 A218t23  2012   135. 14730.
8 A218t23  2011   143. 14587.
9 A218t23  2010   168. 14419.

CodePudding user response:

Adapting this answer you could do:

library(dplyr)

df %>%
  dplyr::group_by(series) %>%
  dplyr::mutate(diff = c(area[1L], area[1L] - cumsum(area[-1L])))
#> # A tibble: 9 × 4
#> # Groups:   series [1]
#>   series   year   area   diff
#>   <chr>   <int>  <dbl>  <dbl>
#> 1 A218t23  2018 16409. 16409.
#> 2 A218t23  2017   275. 16135.
#> 3 A218t23  2016   294. 15841.
#> 4 A218t23  2015   323. 15518.
#> 5 A218t23  2014   545. 14973.
#> 6 A218t23  2013   108. 14865.
#> 7 A218t23  2012   135. 14730.
#> 8 A218t23  2011   143. 14587.
#> 9 A218t23  2010   168. 14419.

CodePudding user response:

You can achieve that with a tweaked cumulative sum.

Indeed, you start at the first value per group and then you subtract every value after. If you consider every value after the first one to be negative, the cumulative sum will be your expected output.

Here is the code:

library(tidyverse)
df = df %>% 
  mutate(series="A") %>% 
  bind_rows(df)


df %>% 
  group_by(series) %>% 
  mutate(
    x = ifelse(row_number()==1, area, -area),
    diff = cumsum(x)
  )
#> # A tibble: 18 x 5
#> # Groups:   series [2]
#>    series   year   area      x   diff
#>    <chr>   <int>  <dbl>  <dbl>  <dbl>
#>  1 A        2018 16409. 16409. 16409.
#>  2 A        2017   275.  -275. 16135.
#>  3 A        2016   294.  -294. 15841.
#>  4 A        2015   323.  -323. 15518.
#>  5 A        2014   545.  -545. 14973.
#>  6 A        2013   108.  -108. 14865.
#>  7 A        2012   135.  -135. 14730.
#>  8 A        2011   143.  -143. 14587.
#>  9 A        2010   168.  -168. 14419.
#> 10 A218t23  2018 16409. 16409. 16409.
#> 11 A218t23  2017   275.  -275. 16135.
#> 12 A218t23  2016   294.  -294. 15841.
#> 13 A218t23  2015   323.  -323. 15518.
#> 14 A218t23  2014   545.  -545. 14973.
#> 15 A218t23  2013   108.  -108. 14865.
#> 16 A218t23  2012   135.  -135. 14730.
#> 17 A218t23  2011   143.  -143. 14587.
#> 18 A218t23  2010   168.  -168. 14419.

Created on 2021-11-09 by the reprex package (v2.0.1)

CodePudding user response:

If you're working in the tidyverse you can use purrr::accumulate:

library(purrr)
library(dplyr)

df %>% 
  group_by(series) %>% 
  mutate(diff = accumulate(area, ~ .x - .y))

In the purrr function, .x is the current value and .y is the previous value.

Similar to the Reduce answer, you can pass it the arithmetic operator `-`: accumulate(area, `-`).

Output

# A tibble: 9 x 4
# Groups:   series [1]
  series   year   area   diff
  <chr>   <int>  <dbl>  <dbl>
1 A218t23  2018 16409. 16409.
2 A218t23  2017   275. 16135.
3 A218t23  2016   294. 15841.
4 A218t23  2015   323. 15518.
5 A218t23  2014   545. 14973.
6 A218t23  2013   108. 14865.
7 A218t23  2012   135. 14730.
8 A218t23  2011   143. 14587.
9 A218t23  2010   168. 14419.
  • Related