Home > database >  How to use purrr to iterate over reference-dates, filter cases and create a long table
How to use purrr to iterate over reference-dates, filter cases and create a long table

Time:01-02

I have a table of cases. Each row represents a case. A case has a Case_id, a start-date, a complete-date and other attributes. A case is active between its start and complete date. Here's a reprex:

library(tidyverse)
set.seed(42)
Case_id <- seq(1:100)
Start <- sample(seq(as.Date("2022-08-01"), as.Date("2022-12-01"), by = "day"), 100, replace = TRUE)
Complete <- Start   sample(0:60, 1, replace = TRUE)
Other_attributes <- sample(c("Red", "Blue", "Green"), 100, replace = TRUE)

Cases <- tibble(Case_id, Start, Complete, Other_attributes)

head(Cases)
#> # A tibble: 6 × 4
#>   Case_id Start      Complete   Other_attributes
#>     <int> <date>     <date>     <chr>           
#> 1       1 2022-09-18 2022-10-11 Red             
#> 2       2 2022-11-09 2022-12-02 Blue            
#> 3       3 2022-10-04 2022-10-27 Red             
#> 4       4 2022-08-25 2022-09-17 Blue            
#> 5       5 2022-10-13 2022-11-05 Green           
#> 6       6 2022-11-08 2022-12-01 Red

Created on 2023-01-01 with reprex v2.0.2

Ultimately I want to show, how the number of active cases develops over time, possibly grouped by any combination of other attributes. The development over time is represented by weekly reference-dates. In this example I put the reference-dates into a vector, but they could also be organized into a tibble with one column.

Reference_dates <- seq(as.Date("2022-09-04"), as.Date("2022-12-31"), by = "weeks")

To achieve my ultimate objective, I would first create a long table. I iterate over each element of my weekly reference-dates and filter the cases that are active at each reference-date. The filtering of active cases at the first reference-date can be illustrated like this:

Cases %>% filter(Reference_dates[1] >= Start & Reference_dates[1] <= Complete)

The filtered Cases should also be enriched by a new column consisting of the corresponding reference-date.

I repeat this for each reference-date and then row-bind the filtered and enriched cases into a much longer flat table. My problem is how to create this long table.

I want to avoid for loops and use functional programming instead to create the long table. I've looked into list columns with subsequent unnesting and purrr::map() with subsequent purrr::list_rbind(), but I don't know how to start. Any ideas?

Once this problem is solved, I could then easily use rpivottable() or group_by(Reference_dates, Other_attributes) to count the number of active cases per reference-date and also group it by any combination of other attributes.

My true number of cases is actually much larger and I have several other attributes. Please advise if there's a more efficient way to achieve my ultimate objective than creating the long table first.

CodePudding user response:

Using purrr::map_df you could loop over your reference dates, create your filtered datasets and bind them into one df in one step like so:

library(tidyverse)

map_df(Reference_dates, function(x) {
  Cases %>%
    filter(x >= Start & x <= Complete) %>%
    mutate(Reference_Date = x)
})
#> # A tibble: 275 × 5
#>    Case_id Start      Complete   Other_attributes Reference_Date
#>      <int> <date>     <date>     <chr>            <date>        
#>  1       4 2022-08-25 2022-09-17 Blue             2022-09-04    
#>  2       7 2022-08-18 2022-09-10 Red              2022-09-04    
#>  3      11 2022-08-24 2022-09-16 Green            2022-09-04    
#>  4      17 2022-08-20 2022-09-12 Red              2022-09-04    
#>  5      18 2022-08-26 2022-09-18 Red              2022-09-04    
#>  6      24 2022-08-27 2022-09-19 Red              2022-09-04    
#>  7      31 2022-09-03 2022-09-26 Blue             2022-09-04    
#>  8      38 2022-08-24 2022-09-16 Blue             2022-09-04    
#>  9      39 2022-08-30 2022-09-22 Red              2022-09-04    
#> 10      41 2022-08-15 2022-09-07 Red              2022-09-04    
#> # … with 265 more rows

CodePudding user response:

Not a purrr solution, but a dplyr solution.

library(tidyverse)

columnsToAdd <- as.character(Reference_dates)

Cases[,columnsToAdd] <- ""

Cases |> 
  pivot_longer(cols = starts_with('2022')) |> 
  mutate(name = as.Date(name)) |> 
  select(-value) |> 
  filter((name >= Start & name <= Complete))
#> # A tibble: 275 × 5
#>    Case_id Start      Complete   Other_attributes name      
#>      <int> <date>     <date>     <chr>            <date>    
#>  1       1 2022-09-18 2022-10-11 Red              2022-09-18
#>  2       1 2022-09-18 2022-10-11 Red              2022-09-25
#>  3       1 2022-09-18 2022-10-11 Red              2022-10-02
#>  4       1 2022-09-18 2022-10-11 Red              2022-10-09
#>  5       2 2022-11-09 2022-12-02 Blue             2022-11-13
#>  6       2 2022-11-09 2022-12-02 Blue             2022-11-20
#>  7       2 2022-11-09 2022-12-02 Blue             2022-11-27
#>  8       3 2022-10-04 2022-10-27 Red              2022-10-09
#>  9       3 2022-10-04 2022-10-27 Red              2022-10-16
#> 10       3 2022-10-04 2022-10-27 Red              2022-10-23
#> # … with 265 more rows
  • Related