I have a dataset with bunch of repeated observation like this.
Id Date Group Diagnosis
1 8/16/2004 Red A
1 8/16/2004 Red B
1 8/16/2004 Red C
2 4/23/2010 Blue A
2 4/23/2010 Blue C
3 5/13/2006 Blue A
3 5/13/2006 Blue B
3 5/13/2006 Blue C
3 6/05/2011 Blue A
3 6/05/2011 Blue B
3 6/05/2011 Blue C
4 10/06/2009 Blue A
4 10/06/2009 Blue B
4 10/06/2009 Blue C
4 7/22/2010 Blue A
4 7/22/2010 Blue B
I like to create an new indicator value that tracks which observation have only one set of observations and which have two . Among those who have two sets of observations the indcator should indicate which is the 1st and 2nd so on.
Expected output
Id Date Group Diagnosis I
1 8/16/2004 Red A 1-1
1 8/16/2004 Red B 1-1
1 8/16/2004 Red C 1-1
2 4/23/2010 Blue A 1-1
2 4/23/2010 Blue C 1-1
3 5/13/2006 Blue A 2-1
3 5/13/2006 Blue B 2-1
3 5/13/2006 Blue C 2-1
3 6/05/2011 Blue A 2-2
3 6/05/2011 Blue B 2-2
3 6/05/2011 Blue C 2-2
4 10/06/2009 Blue A 2-1
4 10/06/2009 Blue B 2-1
4 10/06/2009 Blue C 2-1
4 7/22/2010 Blue A 2-2
4 7/22/2010 Blue B 2-2
Values in column I is 1-1 for Id 1 and Id 2 because both these Ids have only once set of Diagnosis, Id 1 evaluated on 8/16/2004 and Id 2 evaluated on 4/23/2010
Values in column I is 2-1 for Id 3 and Id 4 because both these Ids have two sets of Diagnosis, Id 3 evaluated on 5/13/2006 and 6/05/2011, Id 4 evaluated on 10/06/2009 and 7/22/2010. The first set of observations for Id 3 and Id 4 were on 5/13/2006 and 10/06/2009 so 2-1. The second set of observations for Id 3 and Id 4 were on 6/05/2011 and 7/22/2010 so 2-2
I tried n()
and n_distinct
using group_by( Id,Group, Diagnosis)
but this not working. So any suggestions or help is much appreciated. Thanks.
CodePudding user response:
We can do:
library(dplyr)
df |>
group_by(Id) |>
mutate(I = paste0(n_distinct(Date), "-", match(Date, unique(Date)))) |>
ungroup()
Using paste0
to paste the number of distinct dates in the group and match
to match the date with the first match
Id Date Group Diagnosis I
<int> <chr> <chr> <chr> <chr>
1 1 8/16/2004 Red A 1-1
2 1 8/16/2004 Red B 1-1
3 1 8/16/2004 Red C 1-1
4 2 4/23/2010 Blue A 1-1
5 2 4/23/2010 Blue C 1-1
6 3 5/13/2006 Blue A 2-1
7 3 5/13/2006 Blue B 2-1
8 3 5/13/2006 Blue C 2-1
9 3 6/05/2011 Blue A 2-2
10 3 6/05/2011 Blue B 2-2
11 3 6/05/2011 Blue C 2-2
12 4 10/06/2009 Blue A 2-1
13 4 10/06/2009 Blue B 2-1
14 4 10/06/2009 Blue C 2-1
15 4 7/22/2010 Blue A 2-2
16 4 7/22/2010 Blue B 2-2
CodePudding user response:
Here is a dplyr
way.
df1 <- "Id Date Group Diagnosis
1 8/16/2004 Red A
1 8/16/2004 Red B
1 8/16/2004 Red C
2 4/23/2010 Blue A
2 4/23/2010 Blue C
3 5/13/2006 Blue A
3 5/13/2006 Blue B
3 5/13/2006 Blue C
3 6/05/2011 Blue A
3 6/05/2011 Blue B
3 6/05/2011 Blue C
4 10/06/2009 Blue A
4 10/06/2009 Blue B
4 10/06/2009 Blue C
4 7/22/2010 Blue A
4 7/22/2010 Blue B"
df1 <- read.table(textConnection(df1), header = TRUE)
suppressPackageStartupMessages({
library(dplyr)
})
df1 %>%
group_by(Id, Group) %>%
mutate(I2 = duplicated(Diagnosis),
I1 = any(I2) 1L,
I2 = I2 1L,
I = paste(I1, I2, sep = "-")) %>%
ungroup() %>%
select(-I1, -I2)
#> # A tibble: 16 × 5
#> Id Date Group Diagnosis I
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 8/16/2004 Red A 1-1
#> 2 1 8/16/2004 Red B 1-1
#> 3 1 8/16/2004 Red C 1-1
#> 4 2 4/23/2010 Blue A 1-1
#> 5 2 4/23/2010 Blue C 1-1
#> 6 3 5/13/2006 Blue A 2-1
#> 7 3 5/13/2006 Blue B 2-1
#> 8 3 5/13/2006 Blue C 2-1
#> 9 3 6/05/2011 Blue A 2-2
#> 10 3 6/05/2011 Blue B 2-2
#> 11 3 6/05/2011 Blue C 2-2
#> 12 4 10/06/2009 Blue A 2-1
#> 13 4 10/06/2009 Blue B 2-1
#> 14 4 10/06/2009 Blue C 2-1
#> 15 4 7/22/2010 Blue A 2-2
#> 16 4 7/22/2010 Blue B 2-2
Created on 2022-11-04 with reprex v2.0.2