I have a df with two columns id
and value
:
structure(list(id = c("a", "a", "a", "a", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b"), value = c(8105.292,
45000, -41065, 45000, 46054.15, 50000, 50000, 50000, -180000,
50000, 50000, -180000, 50000, 50000, 50000, 50000)), row.names = c(NA,
-16L), class = "data.frame", sorted = c("x", "y"))
I need to modify the value
column and/or create a third column described by the following:
- if that row is the first occurrence in the table of the
id
, then returnvalue
- Otherwise, return the following:
lag(value) value
My problem is the correct calculation needs to reference the modified lag(value)
... the one that is currently being created. It needs to be applied in sequence, and not reference the original lag(value)
.
Does this make sense? I have tried applying a function via for loop but am not having luck, and think I am missing something simple. My expected output is:
structure(list(id = c("a", "a", "a", "a", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b"), value = c(8105.292,
45000, -41065, 45000, 46054.15, 50000, 50000, 50000, -180000,
50000, 50000, -180000, 50000, 50000, 50000, 50000), expectedvalue = c(8105.292,
53105.292, 12040.292, 57040.292, 46054.15, 96064.15, 146054.15,
196054.15, 16054.15, 66054.15, 116054.15, -64954.85, -13945.85,
36054.15, 86054.15, 136054.15)), row.names = c(NA, -16L), sorted = c("x",
"y"), class = "data.frame")
CodePudding user response:
It sounds like you want the cumulative sum within each id
. No need for loops.
library(dplyr)
df %>% group_by(id) %>% mutate(expectedValue = cumsum(value))
#> # A tibble: 16 x 3
#> # Groups: id [2]
#> id value expectedValue
#> <chr> <dbl> <dbl>
#> 1 a 8105. 8105.
#> 2 a 45000 53105.
#> 3 a -41065 12040.
#> 4 a 45000 57040.
#> 5 b 46054. 46054.
#> 6 b 50000 96054.
#> 7 b 50000 146054.
#> 8 b 50000 196054.
#> 9 b -180000 16054.
#> 10 b 50000 66054.
#> 11 b 50000 116054.
#> 12 b -180000 -63946.
#> 13 b 50000 -13946.
#> 14 b 50000 36054.
#> 15 b 50000 86054.
#> 16 b 50000 136054.
Created on 2022-03-08 by the reprex package (v2.0.1)
CodePudding user response:
Is this what you are looking for?
df <- structure(list(id = c("a", "a", "a", "a", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b"), value = c(8105.292,
45000, -41065, 45000, 46054.15, 50000, 50000, 50000, -180000,
50000, 50000, -180000, 50000, 50000, 50000, 50000)), row.names = c(NA,
-16L), class = "data.frame", sorted = c("x", "y"))
df %>% group_by(id) %>% mutate(value_ = value lag(value)) %>% mutate(value2 = ifelse(is.na(value_), value, value_)) %>% select(id, value, value2)
id value value2
<chr> <dbl> <dbl>
1 a 8105. 8105.
2 a 45000 53105.
3 a -41065 3935
4 a 45000 3935
5 b 46054. 46054.
6 b 50000 96054.
7 b 50000 100000
8 b 50000 100000
9 b -180000 -130000
10 b 50000 -130000
11 b 50000 100000
12 b -180000 -130000
13 b 50000 -130000
14 b 50000 100000
15 b 50000 100000
16 b 50000 100000
If you want to sort it before the group_by, function..
df %>% arrange(id, value) %>% group_by(id) %>% mutate(value_ = value lag(value)) %>% mutate(value2 = ifelse(is.na(value_), value, value_)) %>% select(id, value, value2)
id value value2
<chr> <dbl> <dbl>
1 a -41065 -41065
2 a 8105. -32960.
3 a 45000 53105.
4 a 45000 90000
5 b -180000 -180000
6 b -180000 -360000
7 b 46054. -133946.
8 b 50000 96054.
9 b 50000 100000
10 b 50000 100000
11 b 50000 100000
12 b 50000 100000
13 b 50000 100000
14 b 50000 100000
15 b 50000 100000
16 b 50000 100000