I would like help writing code to handle three variables; a unique ID, some sample dates and results from the sample date. Let’s call the sample results three different names (e.g, NP1, NP2 and NP3). The unique ID can occur on multiple rows, since different sample results (NP1, NP2 and NP3) can be measured on different dates.
IDrow date result
1 ID1 01-01-09 NP1
2 ID1 01-01-10 NP1
3 ID1 01-01-10 NP2
4 ID1 01-01-10 NP3
5 ID1 03-03-15 NP1
6 ID1 03-03-15 NP2
7 ID1 03-03-15 NP3
8 ID2 01-05-10 NP1
9 ID2 01-05-10 NP2
10 ID2 01-05-10 NP3
11 ID3 02-08-11 NP1
12 ID3 02-08-11 NP2
13 ID3 02-08-11 NP3
As the table shows, ID1 has multiple measures, where NP1 was first measured in 2009, then again in 2010 and 2015 along with NP2 and NP3. My problem is getting to the following:
IDrow date result
1 ID1 01-01-10 NP1
2 ID1 01-01-10 NP2
3 ID1 01-01-10 NP3
4 ID2 01-05-10 NP1
5 ID2 01-05-10 NP2
6 ID2 01-05-10 NP3
7 ID3 02-08-11 NP1
8 ID3 02-08-11 NP2
9 ID3 02-08-11 NP3
I need NP1, NP2 and NP3 to occur together (or at least NP1 and NP2) on the earliest date possible. Thus the code need to catch when NP1, NP2 and NP3 occur simoultanesly (on earliest date), and if not then NP1 and NP2 on the earlist date.
I have searched stack for a similar problem, but couldn’t quite find a solution. I have been trying to wrangle the data with dplyr using
library(dplyr)
df <- df %>%
group_by(IDrow) %>%
slice_min(date)
Which led me to some correct rows, but obviously many were omitted due to NP1 being measured before the next date where NP1 and NP2/NP3 were measured together.
Afterwards tried including "result" in group_by(),
df <- df %>%
group_by(IDrow, result) %>%
slice_min(date)
Which then prints first occurrence of NP1 along with later occurring NP2 and NP3 for ID1:
# A tibble: 9 × 3
# Groups: IDrow, result [9]
IDrow date result
<chr> <chr> <chr>
1 ID1 01-01-09 NP1
2 ID1 01-01-10 NP2
3 ID1 01-01-10 NP3
4 ID2 01-05-10 NP1
5 ID2 01-05-10 NP2
6 ID2 01-05-10 NP3
7 ID3 02-08-11 NP1
8 ID3 02-08-11 NP2
9 ID3 02-08-11 NP3
CodePudding user response:
Here's a two-step process. First, to make sure we get both steps, I'll remove one row so that at least one group does not find all 3:
quux <- structure(list(IDrow = c("ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID2", "ID2", "ID2", "ID3", "ID3", "ID3"), date = c("01-01-09", "01-01-10", "01-01-10", "01-01-10", "03-03-15", "03-03-15", "03-03-15", "01-05-10", "01-05-10", "01-05-10", "02-08-11", "02-08-11", "02-08-11"), result = c("NP1", "NP1", "NP2", "NP3", "NP1", "NP2", "NP3", "NP1", "NP2", "NP3", "NP1", "NP2", "NP3")), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"))
quux <- quux[-10,] # artificially remove one row for this test
Now the solution:
quux3 <- quux %>%
group_by(IDrow, date) %>%
filter(all(c("NP1", "NP2", "NP3") %in% result)) %>%
group_by(IDrow) %>%
slice_min(date) %>%
ungroup()
quux2 <- anti_join(quux, quux3, by = "IDrow") %>%
group_by(IDrow, date) %>%
filter(all(c("NP1", "NP2") %in% result)) %>%
group_by(IDrow) %>%
slice_min(date) %>%
ungroup()
bind_rows(quux3, quux2)
# # A tibble: 8 × 3
# IDrow date result
# <chr> <chr> <chr>
# 1 ID1 01-01-10 NP1
# 2 ID1 01-01-10 NP2
# 3 ID1 01-01-10 NP3
# 4 ID3 02-08-11 NP1
# 5 ID3 02-08-11 NP2
# 6 ID3 02-08-11 NP3
# 7 ID2 01-05-10 NP1
# 8 ID2 01-05-10 NP2
or this one-step:
quux %>%
group_by(IDrow, date) %>%
mutate(n32 = case_when(all(c("NP1", "NP2", "NP3") %in% result) ~ 3L, all(c("NP1", "NP2") %in% result) ~ 2L, TRUE ~ 0L)) %>%
group_by(IDrow) %>%
slice_max(n32) %>%
slice_min(date) %>%
ungroup() %>%
select(-n32)
# # A tibble: 8 × 3
# IDrow date result
# <chr> <chr> <chr>
# 1 ID1 01-01-10 NP1
# 2 ID1 01-01-10 NP2
# 3 ID1 01-01-10 NP3
# 4 ID2 01-05-10 NP1
# 5 ID2 01-05-10 NP2
# 6 ID3 02-08-11 NP1
# 7 ID3 02-08-11 NP2
# 8 ID3 02-08-11 NP3
CodePudding user response:
I think this should work. First we turn your date
column into an actual Date
class so we can sort by it. Then I assign a "score" to each ID/date, which is a 2 if all 3 results are present, a 1 if just NP1 and NP2 are present, and 0 otherwise. We then drop the 0-score rows as you don't want to consider them and sort what remains by ID, descending score, and ascending date, and keep the first date within each ID based on that sort.
Your sample data doesn't include any just NP1 and NP2 IDs, but I think this will work for that case as well.
library(lubridate)
library(dplyr)
df %>%
mutate(
date = mdy(date)
) %>%
group_by(IDrow, date) %>%
mutate(
score = case_when(
n_distinct(result) == 3 ~ 2,
"NP1" %in% result & "NP2" %in% result ~ 1,
TRUE ~ 0
)
) %>%
filter(score > 0) %>%
group_by(IDrow) %>%
arrange(desc(score), date, .by_group = TRUE) %>%
filter(date == first(date)) %>%
ungroup()
# # A tibble: 9 × 4
# IDrow date result score
# <chr> <date> <chr> <dbl>
# 1 ID1 2010-01-01 NP1 2
# 2 ID1 2010-01-01 NP2 2
# 3 ID1 2010-01-01 NP3 2
# 4 ID2 2010-01-05 NP1 2
# 5 ID2 2010-01-05 NP2 2
# 6 ID2 2010-01-05 NP3 2
# 7 ID3 2011-02-08 NP1 2
# 8 ID3 2011-02-08 NP2 2
# 9 ID3 2011-02-08 NP3 2
Using this sample data:
df = read.table(header = T, text = ' IDrow date result
1 ID1 01-01-09 NP1
2 ID1 01-01-10 NP1
3 ID1 01-01-10 NP2
4 ID1 01-01-10 NP3
5 ID1 03-03-15 NP1
6 ID1 03-03-15 NP2
7 ID1 03-03-15 NP3
8 ID2 01-05-10 NP1
9 ID2 01-05-10 NP2
10 ID2 01-05-10 NP3
11 ID3 02-08-11 NP1
12 ID3 02-08-11 NP2
13 ID3 02-08-11 NP3')