Filter and extract rows based on multiple conditions-CodePudding

I have a large time-series dataset of patients with different diagnoses. A snapshot of the dataset is below:

time<-rep(1:3, times = 5)
ID<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
Dx1<-c("CBS", "CBS", "CBS", "OtherDx", "OtherDx", "OtherDx", "ACC", "ACC", "ACC", "OtherDx", "OtherDx", "CBS", "OtherDx", "OtherDx", "OtherDx")
Dx2<-c("OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "CBS", "CBS", "CBS", "OtherDx","OtherDx", "OtherDx", "OtherDx","OtherDx", "OtherDx")
df<-tibble(time, ID, Dx1, Dx2)

 # A tibble: 15 x 4
      ID  time Dx1     Dx2    
   <dbl> <int> <chr>   <chr>  
 1     1     1 CBS     OtherDx
 2     1     2 CBS     OtherDx
 3     1     3 CBS     OtherDx
 4     2     1 OtherDx OtherDx
 5     2     2 OtherDx OtherDx
 6     2     3 OtherDx OtherDx
 7     3     1 ACC     CBS    
 8     3     2 ACC     CBS    
 9     3     3 ACC     CBS    
10     4     1 OtherDx OtherDx
11     4     2 OtherDx OtherDx
12     4     3 CBS     OtherDx
13     5     1 OtherDx OtherDx
14     5     2 OtherDx OtherDx
15     5     3 OtherDx OtherDx

Here, I want to filter and keep only those IDs that have "OtherDx" in both Dx1 and Dx2 for all three time observations. In this snapshot, that would mean keeping only IDs 2 and 5 (not ID 4 as there is a non "OtherDx" value at Time 3).

My dplyr code is:

df2 <- df %>%
  group_by(ID, time) %>%
  filter(
    time== c(1:3) & Dx1== "OtherDx" & Dx2== "OtherDx"
  )

But my code doesn't seem to do the job and includes ID 4 as well. What is the best way to extract this data?

CodePudding user response：

You could use if_all(). This condition if_all(Dx1:Dx2, `==`, "OtherDx") is equivalent to Dx1 == "OtherDx" & Dx2 == "OtherDx", and is more concise if there are more Dx to be identified.

library(dplyr)

df %>%
  group_by(ID) %>% 
  filter(all(if_all(Dx1:Dx2, `==`, "OtherDx"))) %>%
  ungroup()

# A tibble: 6 × 4
     ID  time Dx1     Dx2
  <dbl> <int> <chr>   <chr>
1     2     1 OtherDx OtherDx
2     2     2 OtherDx OtherDx
3     2     3 OtherDx OtherDx
4     5     1 OtherDx OtherDx
5     5     2 OtherDx OtherDx
6     5     3 OtherDx OtherDx

CodePudding user response：

You only need to group_by the ID column, not the time column. Then use all() to test whether all observations contain your target string.

library(dplyr)

df %>% 
  group_by(ID) %>% 
  # or all(Dx1 == "OtherDx" & Dx2 == "OtherDx") suggested by @Darren Tsai
  filter(all(Dx1 == "OtherDx") & all(Dx2 == "OtherDx")) %>% 
  ungroup()

# A tibble: 6 × 4
   time    ID Dx1     Dx2    
  <int> <dbl> <chr>   <chr>  
1     1     2 OtherDx OtherDx
2     2     2 OtherDx OtherDx
3     3     2 OtherDx OtherDx
4     1     5 OtherDx OtherDx
5     2     5 OtherDx OtherDx
6     3     5 OtherDx OtherDx

CodePudding user response：

Another option might be to 'pivot_wider()->filter()->pivot_longer()' your dataframe. It's not really necessary for your example, but if you have a larger dataframe and more complicated requirements it might be an option:

library(tidyverse)

time<-rep(1:3, times = 5)
ID<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
Dx1<-c("CBS", "CBS", "CBS", "OtherDx", "OtherDx", "OtherDx", "ACC", "ACC", "ACC", "OtherDx", "OtherDx", "CBS", "OtherDx", "OtherDx", "OtherDx")
Dx2<-c("OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "OtherDx", "CBS", "CBS", "CBS", "OtherDx","OtherDx", "OtherDx", "OtherDx","OtherDx", "OtherDx")
df<-tibble(time, ID, Dx1, Dx2)

df2 <- df %>%
  pivot_wider(values_from = c(Dx1, Dx2),
              names_from = time)
df2
#> # A tibble: 5 × 7
#>      ID Dx1_1   Dx1_2   Dx1_3   Dx2_1   Dx2_2   Dx2_3  
#>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1     1 CBS     CBS     CBS     OtherDx OtherDx OtherDx
#> 2     2 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
#> 3     3 ACC     ACC     ACC     CBS     CBS     CBS    
#> 4     4 OtherDx OtherDx CBS     OtherDx OtherDx OtherDx
#> 5     5 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx

df3 <- df2 %>%
  filter(if_all(starts_with("Dx"), ~.x == "OtherDx"))
df3
#> # A tibble: 2 × 7
#>      ID Dx1_1   Dx1_2   Dx1_3   Dx2_1   Dx2_2   Dx2_3  
#>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1     2 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
#> 2     5 OtherDx OtherDx OtherDx OtherDx OtherDx OtherDx
  
df4 <- df3 %>%
  pivot_longer(-ID,
               names_to = c(".value", "time"),
               names_sep = "_")
df4
#> # A tibble: 6 × 4
#>      ID time  Dx1     Dx2    
#>   <dbl> <chr> <chr>   <chr>  
#> 1     2 1     OtherDx OtherDx
#> 2     2 2     OtherDx OtherDx
#> 3     2 3     OtherDx OtherDx
#> 4     5 1     OtherDx OtherDx
#> 5     5 2     OtherDx OtherDx
#> 6     5 3     OtherDx OtherDx

^{Created on 2022-08-18 by the reprex package (v2.0.1)}