I have a data.frame df
with many groups (series
) where the data area
are presented annually. I'm trying to create a new column where diff
is the difference between area in row 1 and row 2. But I need continue subtracting from the "new" difference. This needs to be done in descending order by year
for each series
.
df<-
structure(list(series = c("A218t23", "A218t23", "A218t23", "A218t23",
"A218t23", "A218t23", "A218t23", "A218t23", "A218t23"), year = 2018:2010,
area = c(16409.3632611811, 274.5866082, 293.8540619, 323.0603775,
544.7366938, 108.0737561, 134.8579038, 143.14125, 167.8244576
)), row.names = c(NA, -9L), groups = structure(list(series = "A218t23",
.rows = structure(list(1:9), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
My desired output would look like this:
Hence, 16409
-275
=16135
then, 16135
-294
=15841
and so on.
The code I've been using:
df_diffs <- df %>%
dplyr::group_by(series) %>%
dplyr::mutate(diff = area - dplyr::lag(area, default=0, order_by = desc(year)))
However, this just returns the lagged difference between rows in the area
column. The result I'm looking for is a "cumulative" or running difference. I've checked out RcppRoll
and some other SO posts, but no luck. Ideally, I could keep this all within a piping framework since I have other functions going on. Bonus points if there's a way to replace the NA in the first row with the corresponding area value for that year.
Suggestions very much appreciated!
CodePudding user response:
Another option, using Reduce()
df %>%
group_by(series) %>%
mutate(diff = Reduce("-", area, accumulate = T))
# A tibble: 9 × 4
# Groups: series [1]
series year area diff
<chr> <int> <dbl> <dbl>
1 A218t23 2018 16409. 16409.
2 A218t23 2017 275. 16135.
3 A218t23 2016 294. 15841.
4 A218t23 2015 323. 15518.
5 A218t23 2014 545. 14973.
6 A218t23 2013 108. 14865.
7 A218t23 2012 135. 14730.
8 A218t23 2011 143. 14587.
9 A218t23 2010 168. 14419.
CodePudding user response:
Adapting this answer you could do:
library(dplyr)
df %>%
dplyr::group_by(series) %>%
dplyr::mutate(diff = c(area[1L], area[1L] - cumsum(area[-1L])))
#> # A tibble: 9 × 4
#> # Groups: series [1]
#> series year area diff
#> <chr> <int> <dbl> <dbl>
#> 1 A218t23 2018 16409. 16409.
#> 2 A218t23 2017 275. 16135.
#> 3 A218t23 2016 294. 15841.
#> 4 A218t23 2015 323. 15518.
#> 5 A218t23 2014 545. 14973.
#> 6 A218t23 2013 108. 14865.
#> 7 A218t23 2012 135. 14730.
#> 8 A218t23 2011 143. 14587.
#> 9 A218t23 2010 168. 14419.
CodePudding user response:
You can achieve that with a tweaked cumulative sum.
Indeed, you start at the first value per group and then you subtract every value after. If you consider every value after the first one to be negative, the cumulative sum will be your expected output.
Here is the code:
library(tidyverse)
df = df %>%
mutate(series="A") %>%
bind_rows(df)
df %>%
group_by(series) %>%
mutate(
x = ifelse(row_number()==1, area, -area),
diff = cumsum(x)
)
#> # A tibble: 18 x 5
#> # Groups: series [2]
#> series year area x diff
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 2018 16409. 16409. 16409.
#> 2 A 2017 275. -275. 16135.
#> 3 A 2016 294. -294. 15841.
#> 4 A 2015 323. -323. 15518.
#> 5 A 2014 545. -545. 14973.
#> 6 A 2013 108. -108. 14865.
#> 7 A 2012 135. -135. 14730.
#> 8 A 2011 143. -143. 14587.
#> 9 A 2010 168. -168. 14419.
#> 10 A218t23 2018 16409. 16409. 16409.
#> 11 A218t23 2017 275. -275. 16135.
#> 12 A218t23 2016 294. -294. 15841.
#> 13 A218t23 2015 323. -323. 15518.
#> 14 A218t23 2014 545. -545. 14973.
#> 15 A218t23 2013 108. -108. 14865.
#> 16 A218t23 2012 135. -135. 14730.
#> 17 A218t23 2011 143. -143. 14587.
#> 18 A218t23 2010 168. -168. 14419.
Created on 2021-11-09 by the reprex package (v2.0.1)
CodePudding user response:
If you're working in the tidyverse you can use purrr::accumulate
:
library(purrr)
library(dplyr)
df %>%
group_by(series) %>%
mutate(diff = accumulate(area, ~ .x - .y))
In the purrr function, .x
is the current value and .y
is the previous value.
Similar to the Reduce
answer, you can pass it the arithmetic operator `-`
: accumulate(area, `-`)
.
Output
# A tibble: 9 x 4
# Groups: series [1]
series year area diff
<chr> <int> <dbl> <dbl>
1 A218t23 2018 16409. 16409.
2 A218t23 2017 275. 16135.
3 A218t23 2016 294. 15841.
4 A218t23 2015 323. 15518.
5 A218t23 2014 545. 14973.
6 A218t23 2013 108. 14865.
7 A218t23 2012 135. 14730.
8 A218t23 2011 143. 14587.
9 A218t23 2010 168. 14419.