Keep rows that are within specific interval for different conditions and grouped by-CodePudding

Here's a reprex for illustration.

library(tidyverse)

set.seed(1337)
df <- tibble(
  date_visit = sample(seq(as.Date("2020/01/01"),
    as.Date("2021/01/01"),
    by = "day"
  ), 400, replace = T),
  patient_id = as.factor(paste("patient", sample(seq(1, 13), 400, replace = T), sep = "_")),
  type_of_visit = as.factor(sample(c("medical", "veterinary"), 400, replace = T))
)

What I'm trying to do create a dataframe where I keep the patient_id (group by, I assume), and the visit types if that patient has done 2 different visits in less than 24 hours. Or adding a variable that says True/False if that condition is met.

I tried to use a left join by patient_id to work with 2 different variables but that takes too much computing time (my original DF is much longer than this)

Can someone point me in the right direction?

Thank you

CodePudding user response：

Maybe this will help -

library(dplyr)

df %>%
  group_by(patient_id, date_visit) %>%
  summarise(flag = n_distinct(type_of_visit) >= 2) %>%
  summarise(flag = any(flag))

#  patient_id flag 
#   <fct>      <lgl>
# 1 patient_1  TRUE 
# 2 patient_10 FALSE
# 3 patient_11 TRUE 
# 4 patient_12 FALSE
# 5 patient_13 FALSE
# 6 patient_2  FALSE
# 7 patient_3  FALSE
# 8 patient_4  FALSE
# 9 patient_5  TRUE 
#10 patient_6  FALSE
#11 patient_7  TRUE 
#12 patient_8  TRUE 
#13 patient_9  TRUE

If you want to keep all the rows for those patient id's

df %>%
  group_by(patient_id, date_visit) %>%
  summarise(flag = n_distinct(type_of_visit) >= 2) %>%
  filter(any(flag))

CodePudding user response：

library(tidyverse)

set.seed(1337)
df <- tibble(
  date_visit = sample(seq(as.Date("2020/01/01"),
    as.Date("2021/01/01"),
    by = "day"
  ), 400, replace = T),
  patient_id = as.factor(paste("patient", sample(seq(1, 13), 400, replace = T), sep = "_")),
  type_of_visit = as.factor(sample(c("medical", "veterinary"), 400, replace = T))
)
df
#> # A tibble: 400 x 3
#>    date_visit patient_id type_of_visit
#>    <date>     <fct>      <fct>        
#>  1 2020-05-26 patient_11 medical      
#>  2 2020-08-29 patient_4  medical      
#>  3 2020-02-18 patient_6  medical      
#>  4 2020-07-28 patient_9  veterinary   
#>  5 2020-05-31 patient_9  veterinary   
#>  6 2020-07-29 patient_1  veterinary   
#>  7 2020-12-21 patient_11 veterinary   
#>  8 2020-07-06 patient_9  veterinary   
#>  9 2020-04-10 patient_3  medical      
#> 10 2020-11-08 patient_12 medical      
#> # … with 390 more rows

df %>%
  group_by(patient_id, date_visit) %>%
  # less than 24h <=> same date
  filter(n() == 2) %>%
  ungroup() %>%
  distinct(patient_id, type_of_visit)
#> # A tibble: 15 x 2
#>    patient_id type_of_visit
#>    <fct>      <fct>        
#>  1 patient_9  veterinary   
#>  2 patient_2  veterinary   
#>  3 patient_11 medical      
#>  4 patient_12 veterinary   
#>  5 patient_2  medical      
#>  6 patient_3  veterinary   
#>  7 patient_5  veterinary   
#>  8 patient_7  veterinary   
#>  9 patient_6  veterinary   
#> 10 patient_11 veterinary   
#> 11 patient_9  medical      
#> 12 patient_10 veterinary   
#> 13 patient_5  medical      
#> 14 patient_1  veterinary   
#> 15 patient_3  medical

^{Created on 2021-10-07 by the reprex package (v2.0.1)}