Home > Blockchain >  Creating new variable based on multiple conditions and comparisons within group
Creating new variable based on multiple conditions and comparisons within group

Time:08-12

I have a set of data roughly like this (more data in dput & desired results below):

   id    date       u         v
   <chr> <date>     <chr> <int>
 1 a     2019-05-14 NA        0
 2 a     2018-06-29 u         1
 3 b     2020-12-02 u         1
 4 b     2017-08-16 NA        1
 5 b     2016-04-07 NA        0
 6 c     2018-05-22 u         1
 7 c     2018-05-22 u         1
 8 e     2019-03-06 u         1
 9 e     2019-03-06 NA        1

I am trying to create a new variable pr identifying, for each id, whether when u == u, there is an equal or earlier date where v == 1 within that id group (regardless of the value of u).

I know how generally to create a new variable based on in-group conditions:

library(dplyr)
x %>% 
  group_by(id) %>% 
  mutate(pr = case_when())

But I can't figure out how to compare the other dates within the group to the date corresponding to u and how to identify the presence of v == 1 not including the u row I am using as a reference. And u will always have v == 1.

Expected output is:

   id    date       u         v    pr
   <chr> <date>     <chr> <int> <int>
 1 a     2019-05-14 NA        0    NA
 2 a     2018-06-29 u         1     0
 3 b     2020-12-02 u         1     1
 4 b     2017-08-16 NA        1    NA
 5 b     2016-04-07 NA        0    NA
 6 c     2018-05-22 u         1     1
 7 c     2018-05-22 u         1     1
 8 e     2019-03-06 u         1     1
 9 e     2019-03-06 NA        1    NA
10 f     2020-10-20 u         1     0
11 f     2019-01-25 NA        0    NA
12 h     2020-02-24 NA        0    NA
13 h     2018-10-15 u         1     0
14 h     2018-03-07 NA        0    NA
15 i     2021-02-02 u         1     1
16 i     2020-11-19 NA        1    NA
17 i     2020-11-19 NA        1    NA
18 j     2019-02-11 u         1     1
19 j     2017-06-26 u         1     0
20 k     2018-12-13 u         1     0
21 k     2017-07-18 NA        0    NA
22 l     2018-05-08 u         1     1
23 l     2018-02-15 NA        0    NA
24 l     2018-02-15 u         1     0
25 l     2017-11-07 NA        0    NA
26 l     2015-09-10 NA        0    NA

The format of the variables isn't ideal; if there's any way for me to help clean it up let me know. Actual data is sensitive so I'm approximating.

> dput(x)
structure(list(id = c("a", "a", "b", "b", "b", "c", "c", "e", 
"e", "f", "f", "h", "h", "h", "i", "i", "i", "j", "j", "k", "k", 
"l", "l", "l", "l", "l"), date = structure(c(18030, 17711, 18598, 
17394, 16898, 17673, 17673, 17961, 17961, 18555, 17921, 18316, 
17819, 17597, 18660, 18585, 18585, 17938, 17343, 17878, 17365, 
17659, 17577, 17577, 17477, 16688), class = "Date"), u = c(NA, 
"u", "u", NA, NA, "u", "u", "u", NA, "u", NA, NA, "u", NA, "u", 
NA, NA, "u", "u", "u", NA, "u", NA, "u", NA, NA), v = c(0L, 1L, 
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L), pr = c(NA, 0L, 1L, NA, NA, 1L, 
1L, 1L, NA, 0L, NA, NA, 0L, NA, 1L, NA, NA, 1L, 0L, 0L, NA, 1L, 
NA, 0L, NA, NA)), row.names = c(NA, -26L), class = c("tbl_df", 
"tbl", "data.frame"))

CodePudding user response:

We may create a function

library(dplyr)
library(purrr)
f1 <- function(u, v, date) {
      # create a variable with only 0s
      tmp <- rep(0, n())
      # create logical vectors based on 'u' value and 1 in `v`
      i1 <- u %in% "u"
      i2 <- v %in% 1
      # loop over the subset of date where v values are 1
      # check whether `all` of the dates are greater than or equal to
      # subset of date where values are 'u' in `u` 
      # and if the number of v values are greater than 1
      # assign it to the 'tmp' where v values are 1 and return the 'tmp' 
      # after assigning NA where u values are NA
      tmp[i2] <-  (purrr::map_lgl(date[i2], 
           ~  all(.x >= date[i1])) & sum(i2) > 1)
      tmp[is.na(u)] <- NA
      tmp
      
      }

and apply it after grouping

x1 <- x %>%
   group_by(id) %>%
   mutate(prnew = f1(u, v, date)) %>% 
   ungroup 
> all.equal(x1$pr, x1$prnew)
[1] TRUE
  • Related