Home > database >  How to access group ids for grouped filter operation?
How to access group ids for grouped filter operation?

Time:03-30

I have a data frame containing dates and associated values for three groups as follows.

library(lubridate)
library(magrittr)
library(dplyr)

data <-
  data.frame(group = rep(c("a", "b", "c"), each = 5),
             date = rep(seq(ymd(20200101), ymd(20200105), by = 1),
                        times = 3),
             value = runif(15))

I would like to perform a grouped filter operation in order to extract subsets of the data by date, where the start date varies by group.

start_date <- list(a = ymd(20200102), b = ymd(20200104), c = ymd(20200103))

I'd like to use the group name to index into the start date list. I try to do the operation as follows, but I get an error message.

data %>%
  group_by(group) %>%
  filter(date >= start_date[[group]]) 
Error in app$vspace(new_style$`margin-top` %||% 0) :
  attempt to apply non-function

What am I doing wrong?

(The alternative to the above procedure is to set the start dates up as a data frame, join the data frame to data, and perform an ungrouped filter. This works as intended, but it's less elegant (in my opinion) and I'd prefer to keep the start dates as a list for other reasons.)

start_date_2 <- 
  data.frame(group = c("a", "b", "c"), 
             start_date = c(ymd(20200102), ymd(20200104), c = ymd(20200103)))

data %>%
  left_join(start_date_2, by = "group") %>%
  filter(date >= start_date) %>%
  select(-start_date)

CodePudding user response:

We can use purrr:

library(lubridate)
library(magrittr)
library(dplyr)
library(purrr)

data %>% 
  group_split(group) %>% 
  map2(., start_date, ~ filter(.x , date >= .y)) %>% 
  bind_rows()

#> # A tibble: 9 x 3
#>   group date        value
#>   <fct> <date>      <dbl>
#> 1 a     2020-01-02 0.0756
#> 2 a     2020-01-03 0.0535
#> 3 a     2020-01-04 0.338 
#> 4 a     2020-01-05 0.332 
#> 5 b     2020-01-04 0.992 
#> 6 b     2020-01-05 0.356 
#> 7 c     2020-01-03 0.751 
#> 8 c     2020-01-04 0.949 
#> 9 c     2020-01-05 0.778

CodePudding user response:

We could use a non-equi join

library(data.table)
setDT(data)[data.table(date = do.call(c, start_date), 
    group = names(start_date)), on = .(group, date >= date)]
    group       date       value
   <char>     <Date>       <num>
1:      a 2020-01-02 0.003896343
2:      a 2020-01-02 0.052198616
3:      a 2020-01-02 0.866560180
4:      a 2020-01-02 0.576245169
5:      b 2020-01-04 0.531409336
6:      b 2020-01-04 0.383936672
7:      c 2020-01-03 0.041919498
8:      c 2020-01-03 0.363742695
9:      c 2020-01-03 0.856596967

CodePudding user response:

We can access the name of the current group(s) by using cur_group(). This gives us a tibble with every group being a column containing the current group name, so we need to subset it with $group.

library(dplyr)

data %>%
  group_by(group) %>%
  filter(date >= start_date[[cur_group()$group]])

#> # A tibble: 9 x 3
#> # Groups:   group [3]
#>   group date        value
#>   <chr> <date>      <dbl>
#> 1 a     2020-01-02 0.225 
#> 2 a     2020-01-03 0.345 
#> 3 a     2020-01-04 0.110 
#> 4 a     2020-01-05 0.0951
#> 5 b     2020-01-04 0.356 
#> 6 b     2020-01-05 0.345 
#> 7 c     2020-01-03 0.0973
#> 8 c     2020-01-04 0.344 
#> 9 c     2020-01-05 0.418

Created on 2022-03-29 by the reprex package (v0.3.0)

CodePudding user response:

Another option could be:

data %>%
    filter(date >= start_date[match(group, names(start_date))])

  group       date      value
1     a 2020-01-02 0.29910439
2     a 2020-01-03 0.08603731
3     a 2020-01-04 0.78913465
4     a 2020-01-05 0.40025848
5     b 2020-01-04 0.87568202
6     b 2020-01-05 0.21065155
7     c 2020-01-03 0.47403618
8     c 2020-01-04 0.74996630
9     c 2020-01-05 0.08386366
  • Related