I have a set of data roughly like this (more data in dput & desired results below):
id date u v
<chr> <date> <chr> <int>
1 a 2019-05-14 NA 0
2 a 2018-06-29 u 1
3 b 2020-12-02 u 1
4 b 2017-08-16 NA 1
5 b 2016-04-07 NA 0
6 c 2018-05-22 u 1
7 c 2018-05-22 u 1
8 e 2019-03-06 u 1
9 e 2019-03-06 NA 1
I am trying to create a new variable pr
identifying, for each id
, whether when u == u
, there is an equal or earlier date
where v == 1
within that id
group (regardless of the value of u
).
I know how generally to create a new variable based on in-group conditions:
library(dplyr)
x %>%
group_by(id) %>%
mutate(pr = case_when())
But I can't figure out how to compare the other dates within the group to the date corresponding to u
and how to identify the presence of v == 1
not including the u
row I am using as a reference. And u
will always have v == 1
.
Expected output is:
id date u v pr
<chr> <date> <chr> <int> <int>
1 a 2019-05-14 NA 0 NA
2 a 2018-06-29 u 1 0
3 b 2020-12-02 u 1 1
4 b 2017-08-16 NA 1 NA
5 b 2016-04-07 NA 0 NA
6 c 2018-05-22 u 1 1
7 c 2018-05-22 u 1 1
8 e 2019-03-06 u 1 1
9 e 2019-03-06 NA 1 NA
10 f 2020-10-20 u 1 0
11 f 2019-01-25 NA 0 NA
12 h 2020-02-24 NA 0 NA
13 h 2018-10-15 u 1 0
14 h 2018-03-07 NA 0 NA
15 i 2021-02-02 u 1 1
16 i 2020-11-19 NA 1 NA
17 i 2020-11-19 NA 1 NA
18 j 2019-02-11 u 1 1
19 j 2017-06-26 u 1 0
20 k 2018-12-13 u 1 0
21 k 2017-07-18 NA 0 NA
22 l 2018-05-08 u 1 1
23 l 2018-02-15 NA 0 NA
24 l 2018-02-15 u 1 0
25 l 2017-11-07 NA 0 NA
26 l 2015-09-10 NA 0 NA
The format of the variables isn't ideal; if there's any way for me to help clean it up let me know. Actual data is sensitive so I'm approximating.
> dput(x)
structure(list(id = c("a", "a", "b", "b", "b", "c", "c", "e",
"e", "f", "f", "h", "h", "h", "i", "i", "i", "j", "j", "k", "k",
"l", "l", "l", "l", "l"), date = structure(c(18030, 17711, 18598,
17394, 16898, 17673, 17673, 17961, 17961, 18555, 17921, 18316,
17819, 17597, 18660, 18585, 18585, 17938, 17343, 17878, 17365,
17659, 17577, 17577, 17477, 16688), class = "Date"), u = c(NA,
"u", "u", NA, NA, "u", "u", "u", NA, "u", NA, NA, "u", NA, "u",
NA, NA, "u", "u", "u", NA, "u", NA, "u", NA, NA), v = c(0L, 1L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L), pr = c(NA, 0L, 1L, NA, NA, 1L,
1L, 1L, NA, 0L, NA, NA, 0L, NA, 1L, NA, NA, 1L, 0L, 0L, NA, 1L,
NA, 0L, NA, NA)), row.names = c(NA, -26L), class = c("tbl_df",
"tbl", "data.frame"))
CodePudding user response:
We may create a function
library(dplyr)
library(purrr)
f1 <- function(u, v, date) {
# create a variable with only 0s
tmp <- rep(0, n())
# create logical vectors based on 'u' value and 1 in `v`
i1 <- u %in% "u"
i2 <- v %in% 1
# loop over the subset of date where v values are 1
# check whether `all` of the dates are greater than or equal to
# subset of date where values are 'u' in `u`
# and if the number of v values are greater than 1
# assign it to the 'tmp' where v values are 1 and return the 'tmp'
# after assigning NA where u values are NA
tmp[i2] <- (purrr::map_lgl(date[i2],
~ all(.x >= date[i1])) & sum(i2) > 1)
tmp[is.na(u)] <- NA
tmp
}
and apply it after grouping
x1 <- x %>%
group_by(id) %>%
mutate(prnew = f1(u, v, date)) %>%
ungroup
> all.equal(x1$pr, x1$prnew)
[1] TRUE