Home > Software engineering >  R Compare Lists and Identify Individual Changes
R Compare Lists and Identify Individual Changes

Time:06-05

Here's an ice cream shop data frame in R, that shows the flavors at a store month to month.

df <- data.frame(date = as.Date(c(rep("2022-01-01", 3), 
                                  rep("2022-02-01", 3),
                                  rep("2022-03-01", 4))),
                 flavor = c("Almond", "Apple", "Apricot", 
                            "Almond", "Maple", "Mint",
                            "Almond", "Maple", "Mint", "Pumpkin"))

#>         date  flavor
#> 1  2022-01-01  Almond
#> 2  2022-01-01   Apple
#> 3  2022-01-01 Apricot
#> 4  2022-02-01  Almond
#> 5  2022-02-01   Maple
#> 6  2022-02-01    Mint
#> 7  2022-03-01  Almond
#> 8  2022-03-01   Maple
#> 9  2022-03-01    Mint
#> 10 2022-03-01 Pumpkin

I created a script to show the ice cream flavors added for any particular month. You can also see flavors removed, as a list. Note that there were not any flavors removed in March (2022-03-01, flavors.removed equals <chr [0]>).

library(dplyr)
library(tidyr)
df %>% 
  group_by(date) %>% 
  summarize(flavors = list(flavor)) %>% 
  mutate(flavors.added = mapply(setdiff, flavors, lag(flavors)),
         flavors.removed = mapply(setdiff, lag(flavors), flavors)) %>% 
  ungroup %>% 
  select(-flavors) %>% 
  unnest_longer(flavors.added)

#> # A tibble: 6 x 3
#>   date       flavors.added flavors.removed
#>   <date>     <chr>         <list>         
#> 1 2022-01-01 Almond        <NULL>         
#> 2 2022-01-01 Apple         <NULL>         
#> 3 2022-01-01 Apricot       <NULL>         
#> 4 2022-02-01 Maple         <chr [2]>      
#> 5 2022-02-01 Mint          <chr [2]>      
#> 6 2022-03-01 Pumpkin       <chr [0]>  

When I attempt to capture information on the flavors removed via a call to unnest_longer(flavors.removed) I end up inadvertently filtering out all information for 2022-03-01 because the flavors.removed list is empty (<chr [0]>) for this 2022-03-01 time period.

library(dplyr)
library(tidyr)
df %>% 
  group_by(date) %>% 
  summarize(flavors = list(flavor)) %>% 
  mutate(flavors.added = mapply(setdiff, flavors, lag(flavors)),
         flavors.removed = mapply(setdiff, lag(flavors), flavors)) %>% 
  ungroup %>% 
  select(-flavors) %>% 
  unnest_longer(flavors.added) %>% 
  unnest_longer(flavors.removed) %>% 
  pivot_longer(-date, names_to = "type", values_to = "flavor") %>% 
  arrange(date, type) %>% 
  unique()

#> # A tibble: 8 x 3
#>   date       type            flavor 
#>   <date>     <chr>           <chr>  
#> 1 2022-01-01 flavors.added   Almond 
#> 2 2022-01-01 flavors.added   Apple  
#> 3 2022-01-01 flavors.added   Apricot
#> 4 2022-01-01 flavors.removed NA     
#> 5 2022-02-01 flavors.added   Maple  
#> 6 2022-02-01 flavors.added   Mint   
#> 7 2022-02-01 flavors.removed Apple  
#> 8 2022-02-01 flavors.removed Apricot

Is there a better approach to individually identify flavors added and removed, month to month? I need to regain the ninth row as shown below, that gets filtered out utilizing my flawed approach.

#> # A tibble: 9 x 3
#>   date       type            flavor 
#>   <date>     <chr>           <chr>  
#> 1 2022-01-01 flavors.added   Almond 
#> 2 2022-01-01 flavors.added   Apple  
#> 3 2022-01-01 flavors.added   Apricot
#> 4 2022-01-01 flavors.removed NA     
#> 5 2022-02-01 flavors.added   Maple  
#> 6 2022-02-01 flavors.added   Mint   
#> 7 2022-02-01 flavors.removed Apple  
#> 8 2022-02-01 flavors.removed Apricot
#> 9 2022-03-01 flavors.added   Pumpkin

CodePudding user response:

A possible solution if you don't need the NA on row 4:

df %>% 
  group_by(date) %>% 
  summarize(flavors = list(flavor)) %>% 
  mutate(flavors.added = mapply(setdiff, flavors, lag(flavors)),
         flavors.removed = mapply(setdiff, lag(flavors), flavors)) %>% 
  ungroup %>% 
  select(-flavors) %>% 
  pivot_longer(-date, names_to = "type", values_to = "flavor") %>% 
  unnest(flavor)

# A tibble: 8 × 3
  date       type            flavor 
  <date>     <chr>           <chr>  
1 2022-01-01 flavors.added   Almond 
2 2022-01-01 flavors.added   Apple  
3 2022-01-01 flavors.added   Apricot
4 2022-02-01 flavors.added   Maple  
5 2022-02-01 flavors.added   Mint   
6 2022-02-01 flavors.removed Apple  
7 2022-02-01 flavors.removed Apricot
8 2022-03-01 flavors.added   Pumpkin

CodePudding user response:

I find it more straightforward to find the added and removed flavors separately, and then bind them together at the end if you need to.

You can use tidyr::unchop(keep_empty = TRUE) to keep from dropping the empty rows in this case.

library(tidyverse)

df <- tibble(
  date = as.Date(c(
    rep("2022-01-01", 3), 
    rep("2022-02-01", 3),
    rep("2022-03-01", 4)
  )),
  flavor = c(
    "Almond", "Apple", "Apricot", 
    "Almond", "Maple", "Mint",
    "Almond", "Maple", "Mint", "Pumpkin"
  )
)

df
#> # A tibble: 10 × 2
#>    date       flavor 
#>    <date>     <chr>  
#>  1 2022-01-01 Almond 
#>  2 2022-01-01 Apple  
#>  3 2022-01-01 Apricot
#>  4 2022-02-01 Almond 
#>  5 2022-02-01 Maple  
#>  6 2022-02-01 Mint   
#>  7 2022-03-01 Almond 
#>  8 2022-03-01 Maple  
#>  9 2022-03-01 Mint   
#> 10 2022-03-01 Pumpkin

flavors <- df %>% 
  group_by(date) %>% 
  summarize(flavors = list(flavor)) %>% 
  ungroup()

flavors
#> # A tibble: 3 × 2
#>   date       flavors  
#>   <date>     <list>   
#> 1 2022-01-01 <chr [3]>
#> 2 2022-02-01 <chr [3]>
#> 3 2022-03-01 <chr [4]>

# Find added flavors
added <- flavors %>%
  mutate(added = mapply(setdiff, flavors, lag(flavors)), .keep = "unused") %>% 
  unchop(added, keep_empty = TRUE) %>%
  pivot_longer(added, names_to = "type", values_to = "flavor")

# Find removed flavors
removed <- flavors %>%
  mutate(removed = mapply(setdiff, lag(flavors), flavors), .keep = "unused") %>% 
  unchop(removed, keep_empty = TRUE) %>%
  pivot_longer(removed, names_to = "type", values_to = "flavor")

added
#> # A tibble: 6 × 3
#>   date       type  flavor 
#>   <date>     <chr> <chr>  
#> 1 2022-01-01 added Almond 
#> 2 2022-01-01 added Apple  
#> 3 2022-01-01 added Apricot
#> 4 2022-02-01 added Maple  
#> 5 2022-02-01 added Mint   
#> 6 2022-03-01 added Pumpkin
removed
#> # A tibble: 4 × 3
#>   date       type    flavor 
#>   <date>     <chr>   <chr>  
#> 1 2022-01-01 removed <NA>   
#> 2 2022-02-01 removed Apple  
#> 3 2022-02-01 removed Apricot
#> 4 2022-03-01 removed <NA>

bind_rows(added, removed) %>%
  arrange(date, type)
#> # A tibble: 10 × 3
#>    date       type    flavor 
#>    <date>     <chr>   <chr>  
#>  1 2022-01-01 added   Almond 
#>  2 2022-01-01 added   Apple  
#>  3 2022-01-01 added   Apricot
#>  4 2022-01-01 removed <NA>   
#>  5 2022-02-01 added   Maple  
#>  6 2022-02-01 added   Mint   
#>  7 2022-02-01 removed Apple  
#>  8 2022-02-01 removed Apricot
#>  9 2022-03-01 added   Pumpkin
#> 10 2022-03-01 removed <NA>

Created on 2022-06-02 by the reprex package (v2.0.1)

  • Related