Im struggling to figure out how to do the following in R. imagine the following data set:
pdata <- tibble(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
Basically what Im trying to do is to add a calculated column where it will see the ID in the row and sum the values for that ID given that the time is before that row. for example, in row 3 it would see that for id 1 there were two records older than 2018 so it adds them up to be 2. so the new calculated column would have a value of 2 for row 3. the following is what I need the example to look like.
# A tibble: 50 × 4
id time value OUTPUT
<int> <int> <dbl> <dbl>
1 1 2016 1 0
2 1 2017 1 1
3 1 2018 1 2
4 1 2019 0 3
5 1 2020 0 3
6 2 2016 1 0
7 2 2017 1 1
8 2 2018 0 2
9 2 2019 0 2
10 2 2020 0 2
# … with 40 more rows
thank you!
CodePudding user response:
Sort by id
and time
, group by id
and compute the cumulative sum of the lagged values.
suppressPackageStartupMessages(library(dplyr))
pdata <- tibble(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
pdata %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(OUTPUT = cumsum(lag(value, default = 0))) %>%
ungroup()
#> # A tibble: 50 × 4
#> id time value OUTPUT
#> <int> <int> <dbl> <dbl>
#> 1 1 2016 1 0
#> 2 1 2017 1 1
#> 3 1 2018 1 2
#> 4 1 2019 0 3
#> 5 1 2020 0 3
#> 6 2 2016 1 0
#> 7 2 2017 1 1
#> 8 2 2018 0 2
#> 9 2 2019 0 2
#> 10 2 2020 0 2
#> # … with 40 more rows
Created on 2022-09-11 by the reprex package (v2.0.1)
CodePudding user response:
library(tidyverse)
df <- data.frame(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
df1 <- df %>%
group_by(id) %>%
mutate(output = case_when(time < 2018 ~ 1,
TRUE ~ 0)) %>%
mutate(output = cumsum(lag(output, default = 0)))
Console:
id time value output
<int> <int> <dbl> <dbl>
1 1 2016 1 0
2 1 2017 1 1
3 1 2018 1 2
4 1 2019 0 2
5 1 2020 0 2
6 2 2016 1 0
7 2 2017 1 1
8 2 2018 0 2
9 2 2019 0 2
10 2 2020 0 2