Find set of rows in row-specific range with restriction at different levels-CodePudding

I have a dataset where each row is identified by a hospitalization id - physician id. Each row also contains information on the dates of admission and discharge and the hospital where it took place. A hospitalization may involve multiple physicians. A physician may work in multiple hospitals.

I have another dataset with information on each physician's specialties (e.g., clinician, cardiologist). A physician may have multiple specialties.

I would like to know, for each hospitalization-physician row, the id of all other hospitalizations concluded in the 30 days prior to the start of that hospitalization which were performed in the same hospital by other physicians in the same specialty.

Partly using the solution posted in R (dplyr): find all rows in row-specific range *WITH RESTRICTION*, I managed to write a code that finds all hospitalizations performed by other physicians in the same hospital in the 30-day period prior to the start of the given hospitalization. For each row, I first found the list of all hospitalizations in the given hospital during the 30-day time period. Then, I found a list that only includes the hospitalizations where the ego physician was involved. Finally, I selected the elements in the first list which are not in the second list.

I would like to adjust the code to find the hospitalizations by other physicians who share at least one specialty with the ego physician. Ideally, I would like to change the first step of the code above to find the list of all hospitalizations within the ego physician's set of specialty. Then I can use the rest of the code as it is to subtract from this list the hospitalizations involving the ego physician. The main difficulty here is the fact that a physician may have multiple specialties - otherwise it would just be a matter of including another variable in the filter function.

Below is the code I have now - it does not take into consideration the specialty of the ego physician.

df <- data.frame(hospitalization_id = c(1, 2, 3,
                                        1, 2, 3,
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 2,
                                  3, 3, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))

df2 <- df %>%
  mutate(
    # Generates 30-day time interval before start of given hospitalization 
    date_range1 = date_start - 30,
    date_range2 = date_start - 1,
    # List of all hospitalizations in given hospital, in time interval
    hospid_all = pmap(list(date_range1, date_range2, hospital_id),
                      function(x, y, z) filter(df,
                                               date_end >= x & date_end <= y,
                                               hospital_id == z)$hospitalization_id),
    hospid_all = lapply(hospid_all, unique),
    # List of ego's hospitalizations in given hospital, in time interval
    hospid_ego = pmap(list(date_range1, date_range2, hospital_id, physician_id),
                      function(x, y, z, p) filter(df,
                                                  date_end >= x & date_end <= y,
                                                  hospital_id == z,
                                                  physician_id == p)$hospitalization_id),
    # List of peers' hospitalizations in given hospital, in time interval
    hospid_peer = future_map2(hospid_all, hospid_ego, ~ .x[!(.x %in% .y)])) %>%
  select(-starts_with('date_'), -hospid_all, -hospid_ego) %>% # only keep peers' list of hospitalization
  rename('ego'='physician_id')

df3 <- df2 %>%
  select(hospitalization_id, hospital_id, ego, hospid_peer) %>%
  unnest(hospid_peer, keep_empty = TRUE)

df4 <- df3 %>%
  left_join(select(df, hospitalization_id, physician_id), 
            by=c('hospid_peer'='hospitalization_id')) %>%
  rename(alter = physician_id)

The specialty of each physician is informed in this other df. In this example, Physician 2 shares specialty with both Physician 1 and Physician 3, but Physician 1 and Physician 3 do not have any specialty in common.

physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
                      specialty_code = c(100, 100, 200, 200))

CodePudding user response：

Here is what I have so far, please experiment with other examples and the rest of the data set to see if it is what you are looking for:

df <- data.frame(hospitalization_id = c(1, 2, 3,
                                        1, 2, 3,
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 2,
                                  3, 3, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))

physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
                      specialty_code = c(100, 100, 200, 200)) %>%
  group_by(physician_id) %>%
  summarise(specialties = list(specialty_code))

df_with_date_range <- df %>%
  mutate(date_range1 = date_start - 31,
         date_range2 = date_start - 1) %>%
  as_tibble() %>%
  left_join(physician_spec, by = "physician_id") 

#Below uses specialty

df_with_date_range %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id, specialties),
                   function(x, y, z, p, s) filter(df_with_date_range,
                                                 date_start >= x & date_start <= y,
                                                 hospital_id == z,
                                                 physician_id != p,
                                                 any(specialties %in% s))$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

#Below does not use specialty

df_with_date_range %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
                   function(x, y, z, p) filter(df_with_date_range,
                                                 date_start >= x & date_start <= y,
                                                 hospital_id == z,
                                                 physician_id != p)$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

It appears that for this data set, the specialty included and unincluded versions are identical, so you have to play around with it and see either if I have made a mistake or this is the case. I essentially just joined dataframes and then added another filter condition to the pmap.

CodePudding user response：

You can create two helper functions, other_mds, and f. The first of these takes a physician id, and returns the physician ids of those with matching specialities. The second takes the hospital id, the physician id, and the start date (i.e. for a particular row in df, and return the list of hospitalizations that concluded within the prior 30 days, were in the same hospital, and were conducted by a physician with a matching speciality.

other_mds <- function(pid) {
  physician_spec[
    physician_id!=pid & specialty_code %in% physician_spec[physician_id==pid, specialty_code],
    physician_id]
}

f <- function(hid, pid, s) {
  other_phys = other_mds(pid)
  df[hospital_id == hid & physician_id %in% other_phys & s>date_end & (s-date_end)<30,paste0(hospitalization_id, collapse=",")]
}

Now, we just apply the function f to each row

library(data.table)
setDT(df)
setDT(physician_spec)
df[, matches:=f(hospital_id, physician_id,date_start), 1:nrow(df)]

Output:

    hospitalization_id hospital_id physician_id date_start   date_end matches
                 <num>      <char>        <num>     <Date>     <Date>  <char>
 1:                  1           A            1 2000-01-01 2000-01-03        
 2:                  2           A            1 2000-01-12 2000-01-18       1
 3:                  3           A            1 2000-01-20 2000-01-22     1,2
 4:                  1           A            2 2000-01-01 2000-01-03        
 5:                  2           A            2 2000-01-12 2000-01-18       1
 6:                  3           A            2 2000-01-20 2000-01-22   1,2,4
 7:                  4           A            3 2000-01-12 2000-01-18       1
 8:                  5           A            3 2000-01-20 2000-01-22     1,2
 9:                  6           B            2 2000-02-10 2000-02-11        
10:                  7           B            2 2000-02-11 2000-02-14        
11:                  8           B            2 2000-02-12 2000-02-17