df <- data.frame(date = as.Date(c(rep("2022-01-01", 3),
rep("2022-02-01", 3),
rep("2022-03-01", 4))),
flavor = c("Almond", "Apple", "Apricot",
"Almond", "Maple", "Mint",
"Apricot", "Pecan", "Praline", "Pumpkin"))
#> date flavor
#> 1 2022-01-01 Almond
#> 2 2022-01-01 Apple
#> 3 2022-01-01 Apricot
#> 4 2022-02-01 Almond
#> 5 2022-02-01 Maple
#> 6 2022-02-01 Mint
#> 7 2022-03-01 Apricot
#> 8 2022-03-01 Pecan
#> 9 2022-03-01 Praline
#> 10 2022-03-01 Pumpkin
The R data frame above tracks ice cream flavors at an ice cream shop, month to month. In the month of February there were two flavors added that were not present in the month of January (Maple, Mint), and two flavors removed (Apple, Apricot) that were present in January. In the month of March there were four flavors added that were not present in the month of February (Apricot, Pecan, Praline, Pumpkin), and three flavors removed (Almond, Maple, Mint) that were present in February.
#> date flavors.added flavors.removed
#> 1 2022-01-01 <NA> <NA>
#> 2 2022-02-01 2 2
#> 3 2022-03-01 4 3
How do I write an R script to calculate the summary data frame above? That is to say I want a rolling count of ice cream flavors that were added per month that were not present in the prior month, and also a count of flavors removed per month that were present in the prior month.
CodePudding user response:
Using data.table
:
library(data.table)
df2 = setDT(df)[, .(flavors = list(flavor)), date]
for (i in 2:nrow(df2))
set(
df2, i = i,
j = c('flavors_added', 'flavors_removed'),
list(length(setdiff(df2$flavors[[i]], df2$flavors[[i-1]])), length(setdiff(df2$flavors[[i-1]], df2$flavors[[i]])))
)
df2
# date flavors flavors_added flavors_removed
# <Date> <list> <int> <int>
# 1: 2022-01-01 Almond,Apple,Apricot NA NA
# 2: 2022-02-01 Almond,Maple,Mint 2 2
# 3: 2022-03-01 Apricot,Pecan,Praline,Pumpkin 4 3
CodePudding user response:
In dplyr
:
library(dplyr)
df %>%
group_by(date) %>%
summarise(flavors = list(flavor)) %>%
mutate(flavors.added = lengths(mapply(setdiff, flavors, lag(flavors))),
flavors.removed = lengths(mapply(setdiff, lag(flavors), flavors)))
output
# A tibble: 3 × 4
date flavors flavors.added flavors.removed
<date> <list> <int> <int>
1 2022-01-01 <chr [3]> 3 0
2 2022-02-01 <chr [3]> 2 2
3 2022-03-01 <chr [4]> 4 3