I have a large time-series dataset of patients with different diagnoses. A snapshot of the dataset is below:
time<-rep(1:3, times = 5)
ID<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
Dx1<-c("CBS", "CBS", "CBS", "OtherDx", "OtherDx", "OtherDx", "ACC", "ACC", "ACC", "OtherDx", "OtherDx", "CBS", "OtherDx", "OtherDx", "OtherDx")
Dx2<-c("OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "CBS", "CBS", "CBS", "OtherDx","OtherDx", "OtherDx", "OtherDx","OtherDx", "OtherDx")
df<-tibble(time, ID, Dx1, Dx2)
# A tibble: 15 x 4
ID time Dx1 Dx2
<dbl> <int> <chr> <chr>
1 1 1 CBS OtherDx
2 1 2 CBS OtherDx
3 1 3 CBS OtherDx
4 2 1 OtherDx OtherDx
5 2 2 OtherDx OtherDx
6 2 3 OtherDx OtherDx
7 3 1 ACC CBS
8 3 2 ACC CBS
9 3 3 ACC CBS
10 4 1 OtherDx OtherDx
11 4 2 OtherDx OtherDx
12 4 3 CBS OtherDx
13 5 1 OtherDx OtherDx
14 5 2 OtherDx OtherDx
15 5 3 OtherDx OtherDx
Here, I want to filter and keep only those IDs that have "OtherDx" in both Dx1 and Dx2 for all three time observations. In this snapshot, that would mean keeping only IDs 2 and 5 (not ID 4 as there is a non "OtherDx" value at Time 3).
My dplyr code is:
df2 <- df %>%
group_by(ID, time) %>%
filter(
time== c(1:3) & Dx1== "OtherDx" & Dx2== "OtherDx"
)
But my code doesn't seem to do the job and includes ID 4 as well. What is the best way to extract this data?
CodePudding user response:
You could use if_all()
. This condition if_all(Dx1:Dx2, `==`, "OtherDx")
is equivalent to Dx1 == "OtherDx" & Dx2 == "OtherDx"
, and is more concise if there are more Dx
to be identified.
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(if_all(Dx1:Dx2, `==`, "OtherDx"))) %>%
ungroup()
# A tibble: 6 × 4
ID time Dx1 Dx2
<dbl> <int> <chr> <chr>
1 2 1 OtherDx OtherDx
2 2 2 OtherDx OtherDx
3 2 3 OtherDx OtherDx
4 5 1 OtherDx OtherDx
5 5 2 OtherDx OtherDx
6 5 3 OtherDx OtherDx
CodePudding user response:
You only need to group_by
the ID
column, not the time
column. Then use all()
to test whether all observations contain your target string.
library(dplyr)
df %>%
group_by(ID) %>%
# or all(Dx1 == "OtherDx" & Dx2 == "OtherDx") suggested by @Darren Tsai
filter(all(Dx1 == "OtherDx") & all(Dx2 == "OtherDx")) %>%
ungroup()
# A tibble: 6 × 4
time ID Dx1 Dx2
<int> <dbl> <chr> <chr>
1 1 2 OtherDx OtherDx
2 2 2 OtherDx OtherDx
3 3 2 OtherDx OtherDx
4 1 5 OtherDx OtherDx
5 2 5 OtherDx OtherDx
6 3 5 OtherDx OtherDx
CodePudding user response:
Another option might be to 'pivot_wider()
->filter()
->pivot_longer()
' your dataframe. It's not really necessary for your example, but if you have a larger dataframe and more complicated requirements it might be an option:
library(tidyverse)
time<-rep(1:3, times = 5)
ID<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
Dx1<-c("CBS", "CBS", "CBS", "OtherDx", "OtherDx", "OtherDx", "ACC", "ACC", "ACC", "OtherDx", "OtherDx", "CBS", "OtherDx", "OtherDx", "OtherDx")
Dx2<-c("OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "CBS", "CBS", "CBS", "OtherDx","OtherDx", "OtherDx", "OtherDx","OtherDx", "OtherDx")
df<-tibble(time, ID, Dx1, Dx2)
df2 <- df %>%
pivot_wider(values_from = c(Dx1, Dx2),
names_from = time)
df2
#> # A tibble: 5 × 7
#> ID Dx1_1 Dx1_2 Dx1_3 Dx2_1 Dx2_2 Dx2_3
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 CBS CBS CBS OtherDx OtherDx OtherDx
#> 2 2 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
#> 3 3 ACC ACC ACC CBS CBS CBS
#> 4 4 OtherDx OtherDx CBS OtherDx OtherDx OtherDx
#> 5 5 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
df3 <- df2 %>%
filter(if_all(starts_with("Dx"), ~.x == "OtherDx"))
df3
#> # A tibble: 2 × 7
#> ID Dx1_1 Dx1_2 Dx1_3 Dx2_1 Dx2_2 Dx2_3
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
#> 2 5 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
df4 <- df3 %>%
pivot_longer(-ID,
names_to = c(".value", "time"),
names_sep = "_")
df4
#> # A tibble: 6 × 4
#> ID time Dx1 Dx2
#> <dbl> <chr> <chr> <chr>
#> 1 2 1 OtherDx OtherDx
#> 2 2 2 OtherDx OtherDx
#> 3 2 3 OtherDx OtherDx
#> 4 5 1 OtherDx OtherDx
#> 5 5 2 OtherDx OtherDx
#> 6 5 3 OtherDx OtherDx
Created on 2022-08-18 by the reprex package (v2.0.1)