Home > Software design >  R (dplyr): find all rows in row-specific range *WITH RESTRICTION*
R (dplyr): find all rows in row-specific range *WITH RESTRICTION*

Time:08-16

I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, the dates of admission and discharge, as well as the identification of the hospital where it took place and the physician responsible for it.

I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization that were performed by other physicians in the same hospital.

Below is a simple example of 8 hospitalizations performed by 2 physicians in 2 hospitals (physicians may work in more than one hospital).

library("tidyverse")
df <- data.frame(hospitalization_id = c(1, 2, 3, 
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))

Using the solution posted in R (dplyr): find all rows in row-specific range: first, I find all other hospitalizations during the 30-day period prior to the start of the given hospitalization in the given hospital; then, I drop those hospitalizations that were performed by the same physician.

df_with_date_range <- df %>%
  mutate(date_range1 = date_start - 31,
         date_range2 = date_start - 1)

df_semifinal <- df_with_date_range %>%
  rowwise() %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id),
                                     function(x, y, z) ungroup(filter(rowwise(df_with_date_range),
                                                                      between(date_end, x, y),
                                                                      hospital_id == z))$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

df_final <- df_semifinal %>% 
  left_join(select(df, hospitalization_id, physician_id),
            by = c('hospital_id_in_range' = 'hospitalization_id')) %>%
  mutate(hospital_id_in_range = ifelse(physician_id.x == physician_id.y, NA, hospital_id_in_range)) %>%
  select(-physician_id.y) %>%
  rename(physician_id = physician_id.x) %>%
  distinct()

I am trying to write a more efficient code given that my data is massive - ideally I would like to avoid just adding all hospitalizations and then dropping the ones performed by the given physician.

CodePudding user response:

What about just adding another filtering criteria for physician in my previous solution as such:

df_with_date_range %>%
  rowwise() %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
                   function(x, y, z, p) filter(df_with_date_range,
                                                 date_start >= x & date_start <= y,
                                                 hospital_id == z,
                                                 physician_id != p)$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

# # A tibble: 9 × 8
#   hospitalization_id hospital_id physician_id date_start date_end   date_range1 date_range2 hospital_id_in_range
#                <dbl> <chr>              <dbl> <date>     <date>     <date>      <date>                     <dbl>
# 1                  1 A                      1 2000-01-01 2000-01-03 1999-12-01  1999-12-31                    NA
# 2                  2 A                      1 2000-01-12 2000-01-18 1999-12-12  2000-01-11                    NA
# 3                  3 A                      1 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     4
# 4                  4 A                      2 2000-01-12 2000-01-18 1999-12-12  2000-01-11                     1
# 5                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     1
# 6                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     2
# 7                  6 B                      2 2000-02-10 2000-02-11 2000-01-10  2000-02-09                    NA
# 8                  7 B                      2 2000-02-11 2000-02-14 2000-01-11  2000-02-10                    NA
# 9                  8 B                      2 2000-02-12 2000-02-17 2000-01-12  2000-02-11                    NA

Also, I switched the between to a better vectorized implementation that does not require a second rowwise. Using more rowwise tends to slow the procedure down. This one should be a bit faster.

The expected output might be a bit different than yours because I filter during the procedure instead of replacing with NAs afterwards. If you want to replace physician copies with NAs, let me know and I can code a mutate of the filtered data frame to do that.

  • Related