r nested indicator 1st of 1st and 2nd of 1st and 2nd of 2nd-CodePudding

I have a dataset with bunch of repeated observation like this.

  Id    Date        Group  Diagnosis    
  1     8/16/2004   Red    A
  1     8/16/2004   Red    B
  1     8/16/2004   Red    C

  2     4/23/2010   Blue    A
  2     4/23/2010   Blue    C

  3     5/13/2006   Blue    A
  3     5/13/2006   Blue    B
  3     5/13/2006   Blue    C
  3     6/05/2011   Blue    A
  3     6/05/2011   Blue    B
  3     6/05/2011   Blue    C

  4     10/06/2009   Blue    A
  4     10/06/2009   Blue    B
  4     10/06/2009   Blue    C
  4     7/22/2010    Blue    A
  4     7/22/2010    Blue    B

I like to create an new indicator value that tracks which observation have only one set of observations and which have two . Among those who have two sets of observations the indcator should indicate which is the 1st and 2nd so on.

Expected output

      Id    Date        Group  Diagnosis   I   
      1     8/16/2004   Red    A           1-1
      1     8/16/2004   Red    B           1-1
      1     8/16/2004   Red    C           1-1

      2     4/23/2010   Blue    A          1-1
      2     4/23/2010   Blue    C          1-1

      3     5/13/2006   Blue    A          2-1
      3     5/13/2006   Blue    B          2-1      
      3     5/13/2006   Blue    C          2-1 
      3     6/05/2011   Blue    A          2-2 
      3     6/05/2011   Blue    B          2-2
      3     6/05/2011   Blue    C          2-2

      4     10/06/2009   Blue    A         2-1
      4     10/06/2009   Blue    B         2-1
      4     10/06/2009   Blue    C         2-1
      4     7/22/2010    Blue    A         2-2
      4     7/22/2010    Blue    B         2-2

Values in column I is 1-1 for Id 1 and Id 2 because both these Ids have only once set of Diagnosis, Id 1 evaluated on 8/16/2004 and Id 2 evaluated on 4/23/2010

Values in column I is 2-1 for Id 3 and Id 4 because both these Ids have two sets of Diagnosis, Id 3 evaluated on 5/13/2006 and 6/05/2011, Id 4 evaluated on 10/06/2009 and 7/22/2010. The first set of observations for Id 3 and Id 4 were on 5/13/2006 and 10/06/2009 so 2-1. The second set of observations for Id 3 and Id 4 were on 6/05/2011 and 7/22/2010 so 2-2

I tried n() and n_distinct using group_by( Id,Group, Diagnosis) but this not working. So any suggestions or help is much appreciated. Thanks.

CodePudding user response：

We can do:

library(dplyr)

df |> 
  group_by(Id) |> 
  mutate(I = paste0(n_distinct(Date), "-", match(Date, unique(Date)))) |> 
  ungroup()

Using paste0 to paste the number of distinct dates in the group and match to match the date with the first match

      Id Date       Group Diagnosis I    
   <int> <chr>      <chr> <chr>     <chr>
 1     1 8/16/2004  Red   A         1-1  
 2     1 8/16/2004  Red   B         1-1  
 3     1 8/16/2004  Red   C         1-1  
 4     2 4/23/2010  Blue  A         1-1  
 5     2 4/23/2010  Blue  C         1-1  
 6     3 5/13/2006  Blue  A         2-1  
 7     3 5/13/2006  Blue  B         2-1  
 8     3 5/13/2006  Blue  C         2-1  
 9     3 6/05/2011  Blue  A         2-2  
10     3 6/05/2011  Blue  B         2-2  
11     3 6/05/2011  Blue  C         2-2  
12     4 10/06/2009 Blue  A         2-1  
13     4 10/06/2009 Blue  B         2-1  
14     4 10/06/2009 Blue  C         2-1  
15     4 7/22/2010  Blue  A         2-2  
16     4 7/22/2010  Blue  B         2-2

CodePudding user response：

Here is a dplyr way.

df1 <- "Id    Date        Group  Diagnosis    
  1     8/16/2004   Red    A
  1     8/16/2004   Red    B
  1     8/16/2004   Red    C

  2     4/23/2010   Blue    A
  2     4/23/2010   Blue    C

  3     5/13/2006   Blue    A
  3     5/13/2006   Blue    B
  3     5/13/2006   Blue    C
  3     6/05/2011   Blue    A
  3     6/05/2011   Blue    B
  3     6/05/2011   Blue    C

  4     10/06/2009   Blue    A
  4     10/06/2009   Blue    B
  4     10/06/2009   Blue    C
  4     7/22/2010    Blue    A
  4     7/22/2010    Blue    B"
df1 <- read.table(textConnection(df1), header = TRUE)

suppressPackageStartupMessages({
  library(dplyr)
})

df1 %>%
  group_by(Id, Group) %>%
  mutate(I2 = duplicated(Diagnosis),
         I1 = any(I2)   1L,
         I2 = I2   1L,
         I = paste(I1, I2, sep = "-")) %>%
  ungroup() %>%
  select(-I1, -I2)
#> # A tibble: 16 × 5
#>       Id Date       Group Diagnosis I    
#>    <int> <chr>      <chr> <chr>     <chr>
#>  1     1 8/16/2004  Red   A         1-1  
#>  2     1 8/16/2004  Red   B         1-1  
#>  3     1 8/16/2004  Red   C         1-1  
#>  4     2 4/23/2010  Blue  A         1-1  
#>  5     2 4/23/2010  Blue  C         1-1  
#>  6     3 5/13/2006  Blue  A         2-1  
#>  7     3 5/13/2006  Blue  B         2-1  
#>  8     3 5/13/2006  Blue  C         2-1  
#>  9     3 6/05/2011  Blue  A         2-2  
#> 10     3 6/05/2011  Blue  B         2-2  
#> 11     3 6/05/2011  Blue  C         2-2  
#> 12     4 10/06/2009 Blue  A         2-1  
#> 13     4 10/06/2009 Blue  B         2-1  
#> 14     4 10/06/2009 Blue  C         2-1  
#> 15     4 7/22/2010  Blue  A         2-2  
#> 16     4 7/22/2010  Blue  B         2-2

^{Created on 2022-11-04 with reprex v2.0.2}