Home > Enterprise >  trying to add a calculated column where each row calculated is based on a changing data set in R
trying to add a calculated column where each row calculated is based on a changing data set in R

Time:09-11

Im struggling to figure out how to do the following in R. imagine the following data set:

pdata <- tibble(
  id = rep(1:10, each = 5),
  time = rep(2016:2020, times = 10),
  value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)

Basically what Im trying to do is to add a calculated column where it will see the ID in the row and sum the values for that ID given that the time is before that row. for example, in row 3 it would see that for id 1 there were two records older than 2018 so it adds them up to be 2. so the new calculated column would have a value of 2 for row 3. the following is what I need the example to look like.

# A tibble: 50 × 4
      id  time value OUTPUT
   <int> <int> <dbl>  <dbl>
 1     1  2016     1      0
 2     1  2017     1      1
 3     1  2018     1      2
 4     1  2019     0      3
 5     1  2020     0      3
 6     2  2016     1      0
 7     2  2017     1      1
 8     2  2018     0      2
 9     2  2019     0      2
10     2  2020     0      2
# … with 40 more rows

thank you!

CodePudding user response:

Sort by id and time, group by id and compute the cumulative sum of the lagged values.

suppressPackageStartupMessages(library(dplyr))

pdata <- tibble(
  id = rep(1:10, each = 5),
  time = rep(2016:2020, times = 10),
  value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)

pdata %>%
  arrange(id, time) %>%
  group_by(id) %>%
  mutate(OUTPUT = cumsum(lag(value, default = 0))) %>%
  ungroup()
#> # A tibble: 50 × 4
#>       id  time value OUTPUT
#>    <int> <int> <dbl>  <dbl>
#>  1     1  2016     1      0
#>  2     1  2017     1      1
#>  3     1  2018     1      2
#>  4     1  2019     0      3
#>  5     1  2020     0      3
#>  6     2  2016     1      0
#>  7     2  2017     1      1
#>  8     2  2018     0      2
#>  9     2  2019     0      2
#> 10     2  2020     0      2
#> # … with 40 more rows

Created on 2022-09-11 by the reprex package (v2.0.1)

CodePudding user response:

library(tidyverse)

df <- data.frame(
  id = rep(1:10, each = 5),
  time = rep(2016:2020, times = 10),
  value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)

df1 <- df %>% 
  group_by(id) %>% 
  mutate(output = case_when(time < 2018 ~ 1,
                            TRUE ~ 0)) %>% 
  mutate(output = cumsum(lag(output, default = 0)))

Console:

      id  time value output
   <int> <int> <dbl>  <dbl>
 1     1  2016     1      0
 2     1  2017     1      1
 3     1  2018     1      2
 4     1  2019     0      2
 5     1  2020     0      2
 6     2  2016     1      0
 7     2  2017     1      1
 8     2  2018     0      2
 9     2  2019     0      2
10     2  2020     0      2
  •  Tags:  
  • r
  • Related