Home > Net >  Identifying three distinct rows based on the same date
Identifying three distinct rows based on the same date

Time:01-26

I would like help writing code to handle three variables; a unique ID, some sample dates and results from the sample date. Let’s call the sample results three different names (e.g, NP1, NP2 and NP3). The unique ID can occur on multiple rows, since different sample results (NP1, NP2 and NP3) can be measured on different dates.

   IDrow     date result
1    ID1 01-01-09    NP1
2    ID1 01-01-10    NP1
3    ID1 01-01-10    NP2
4    ID1 01-01-10    NP3
5    ID1 03-03-15    NP1
6    ID1 03-03-15    NP2
7    ID1 03-03-15    NP3
8    ID2 01-05-10    NP1
9    ID2 01-05-10    NP2
10   ID2 01-05-10    NP3
11   ID3 02-08-11    NP1
12   ID3 02-08-11    NP2
13   ID3 02-08-11    NP3

As the table shows, ID1 has multiple measures, where NP1 was first measured in 2009, then again in 2010 and 2015 along with NP2 and NP3. My problem is getting to the following:

  IDrow     date result
1   ID1 01-01-10    NP1
2   ID1 01-01-10    NP2
3   ID1 01-01-10    NP3
4   ID2 01-05-10    NP1
5   ID2 01-05-10    NP2
6   ID2 01-05-10    NP3
7   ID3 02-08-11    NP1
8   ID3 02-08-11    NP2
9   ID3 02-08-11    NP3

I need NP1, NP2 and NP3 to occur together (or at least NP1 and NP2) on the earliest date possible. Thus the code need to catch when NP1, NP2 and NP3 occur simoultanesly (on earliest date), and if not then NP1 and NP2 on the earlist date.

I have searched stack for a similar problem, but couldn’t quite find a solution. I have been trying to wrangle the data with dplyr using

library(dplyr)

df <- df %>% 
  group_by(IDrow) %>%
  slice_min(date)

Which led me to some correct rows, but obviously many were omitted due to NP1 being measured before the next date where NP1 and NP2/NP3 were measured together.

Afterwards tried including "result" in group_by(),

df <- df %>% 
  group_by(IDrow, result) %>% 
  slice_min(date)

Which then prints first occurrence of NP1 along with later occurring NP2 and NP3 for ID1:

# A tibble: 9 × 3
# Groups:   IDrow, result [9]
  IDrow date     result
  <chr> <chr>    <chr> 
1 ID1   01-01-09 NP1   
2 ID1   01-01-10 NP2   
3 ID1   01-01-10 NP3   
4 ID2   01-05-10 NP1   
5 ID2   01-05-10 NP2   
6 ID2   01-05-10 NP3   
7 ID3   02-08-11 NP1   
8 ID3   02-08-11 NP2   
9 ID3   02-08-11 NP3

CodePudding user response:

Here's a two-step process. First, to make sure we get both steps, I'll remove one row so that at least one group does not find all 3:

quux <- structure(list(IDrow = c("ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID1", "ID2", "ID2", "ID2", "ID3", "ID3", "ID3"), date = c("01-01-09", "01-01-10", "01-01-10", "01-01-10", "03-03-15", "03-03-15", "03-03-15", "01-05-10", "01-05-10", "01-05-10", "02-08-11", "02-08-11", "02-08-11"), result = c("NP1", "NP1", "NP2", "NP3", "NP1", "NP2", "NP3", "NP1", "NP2", "NP3", "NP1", "NP2", "NP3")), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"))
quux <- quux[-10,] # artificially remove one row for this test

Now the solution:

quux3 <- quux %>%
  group_by(IDrow, date) %>%
  filter(all(c("NP1", "NP2", "NP3") %in% result)) %>%
  group_by(IDrow) %>%
  slice_min(date) %>%
  ungroup()
quux2 <- anti_join(quux, quux3, by = "IDrow") %>%
  group_by(IDrow, date) %>%
  filter(all(c("NP1", "NP2") %in% result)) %>%
  group_by(IDrow) %>%
  slice_min(date) %>%
  ungroup()
bind_rows(quux3, quux2)
# # A tibble: 8 × 3
#   IDrow date     result
#   <chr> <chr>    <chr> 
# 1 ID1   01-01-10 NP1   
# 2 ID1   01-01-10 NP2   
# 3 ID1   01-01-10 NP3   
# 4 ID3   02-08-11 NP1   
# 5 ID3   02-08-11 NP2   
# 6 ID3   02-08-11 NP3   
# 7 ID2   01-05-10 NP1   
# 8 ID2   01-05-10 NP2   

or this one-step:

quux %>%
  group_by(IDrow, date) %>%
  mutate(n32 = case_when(all(c("NP1", "NP2", "NP3") %in% result) ~ 3L, all(c("NP1", "NP2") %in% result) ~ 2L, TRUE ~ 0L)) %>%
  group_by(IDrow) %>%
  slice_max(n32) %>%
  slice_min(date) %>%
  ungroup() %>%
  select(-n32)
# # A tibble: 8 × 3
#   IDrow date     result
#   <chr> <chr>    <chr> 
# 1 ID1   01-01-10 NP1   
# 2 ID1   01-01-10 NP2   
# 3 ID1   01-01-10 NP3   
# 4 ID2   01-05-10 NP1   
# 5 ID2   01-05-10 NP2   
# 6 ID3   02-08-11 NP1   
# 7 ID3   02-08-11 NP2   
# 8 ID3   02-08-11 NP3   

CodePudding user response:

I think this should work. First we turn your date column into an actual Date class so we can sort by it. Then I assign a "score" to each ID/date, which is a 2 if all 3 results are present, a 1 if just NP1 and NP2 are present, and 0 otherwise. We then drop the 0-score rows as you don't want to consider them and sort what remains by ID, descending score, and ascending date, and keep the first date within each ID based on that sort.

Your sample data doesn't include any just NP1 and NP2 IDs, but I think this will work for that case as well.

library(lubridate)
library(dplyr)

df %>%
  mutate(
    date = mdy(date)
  ) %>%
  group_by(IDrow, date) %>%
  mutate(
    score = case_when(
      n_distinct(result) == 3 ~ 2,
      "NP1" %in% result & "NP2" %in% result ~ 1,
      TRUE ~ 0
    )
  ) %>%
  filter(score > 0) %>%
  group_by(IDrow) %>%
  arrange(desc(score), date, .by_group = TRUE) %>%
  filter(date == first(date)) %>%
  ungroup()
# # A tibble: 9 × 4
#   IDrow date       result score
#   <chr> <date>     <chr>  <dbl>
# 1 ID1   2010-01-01 NP1        2
# 2 ID1   2010-01-01 NP2        2
# 3 ID1   2010-01-01 NP3        2
# 4 ID2   2010-01-05 NP1        2
# 5 ID2   2010-01-05 NP2        2
# 6 ID2   2010-01-05 NP3        2
# 7 ID3   2011-02-08 NP1        2
# 8 ID3   2011-02-08 NP2        2
# 9 ID3   2011-02-08 NP3        2

Using this sample data:

df = read.table(header = T, text = '   IDrow     date result
1    ID1 01-01-09    NP1
2    ID1 01-01-10    NP1
3    ID1 01-01-10    NP2
4    ID1 01-01-10    NP3
5    ID1 03-03-15    NP1
6    ID1 03-03-15    NP2
7    ID1 03-03-15    NP3
8    ID2 01-05-10    NP1
9    ID2 01-05-10    NP2
10   ID2 01-05-10    NP3
11   ID3 02-08-11    NP1
12   ID3 02-08-11    NP2
13   ID3 02-08-11    NP3')
  • Related