Home > front end >  R Rolling Counts of Additions, or Deletions, to a List
R Rolling Counts of Additions, or Deletions, to a List

Time:05-13

df <- data.frame(date = as.Date(c(rep("2022-01-01", 3), 
                                  rep("2022-02-01", 3),
                                  rep("2022-03-01", 4))),
                 flavor = c("Almond", "Apple", "Apricot", 
                            "Almond", "Maple", "Mint",
                            "Apricot", "Pecan", "Praline", "Pumpkin"))
#>          date  flavor
#> 1  2022-01-01  Almond
#> 2  2022-01-01   Apple
#> 3  2022-01-01 Apricot
#> 4  2022-02-01  Almond
#> 5  2022-02-01   Maple
#> 6  2022-02-01    Mint
#> 7  2022-03-01 Apricot
#> 8  2022-03-01   Pecan
#> 9  2022-03-01 Praline
#> 10 2022-03-01 Pumpkin

The R data frame above tracks ice cream flavors at an ice cream shop, month to month. In the month of February there were two flavors added that were not present in the month of January (Maple, Mint), and two flavors removed (Apple, Apricot) that were present in January. In the month of March there were four flavors added that were not present in the month of February (Apricot, Pecan, Praline, Pumpkin), and three flavors removed (Almond, Maple, Mint) that were present in February.

#>          date  flavors.added  flavors.removed
#> 1  2022-01-01           <NA>             <NA>
#> 2  2022-02-01              2                2
#> 3  2022-03-01              4                3

How do I write an R script to calculate the summary data frame above? That is to say I want a rolling count of ice cream flavors that were added per month that were not present in the prior month, and also a count of flavors removed per month that were present in the prior month.

CodePudding user response:

Using data.table:

library(data.table)
df2 = setDT(df)[, .(flavors = list(flavor)), date]
for (i in 2:nrow(df2))
  set(
    df2, i = i, 
    j = c('flavors_added', 'flavors_removed'), 
    list(length(setdiff(df2$flavors[[i]], df2$flavors[[i-1]])), length(setdiff(df2$flavors[[i-1]], df2$flavors[[i]])))
  )

df2

#          date                       flavors flavors_added flavors_removed
#        <Date>                        <list>         <int>           <int>
# 1: 2022-01-01          Almond,Apple,Apricot            NA              NA
# 2: 2022-02-01             Almond,Maple,Mint             2               2
# 3: 2022-03-01 Apricot,Pecan,Praline,Pumpkin             4               3

CodePudding user response:

In dplyr:

library(dplyr)
df %>% 
  group_by(date) %>% 
  summarise(flavors = list(flavor)) %>% 
  mutate(flavors.added = lengths(mapply(setdiff, flavors, lag(flavors))),
         flavors.removed = lengths(mapply(setdiff, lag(flavors), flavors)))

output

# A tibble: 3 × 4
  date       flavors   flavors.added flavors.removed
  <date>     <list>            <int>           <int>
1 2022-01-01 <chr [3]>             3               0
2 2022-02-01 <chr [3]>             2               2
3 2022-03-01 <chr [4]>             4               3
  • Related