How can I sift through a list to find elements that changed?-CodePudding

I'm working on a somewhat complex problem for my job. For purposes of explaining the problem, I am simplifying the data into something not real world, but will explain the overall concept.

Problem: I have a set of data by account. For each account, an inventory is performed at different intervals to see what items exist. Normally, the items should be exactly the same from start date, through each inventory date.

We are looking to see, for each account, which items changed from the start date. Here is the data:

Account	Inventory_Date	Item
1278	1/1/2020	Apple
1278	1/1/2020	Desk
1278	1/1/2020	Pear
1278	3/10/2021	Apple
1278	3/10/2021	Desk
1278	3/10/2021	Pear
1278	3/10/2021	Grapes
1278	4/15/2021	Apple
1278	4/15/2021	Pear
1278	4/15/2021	Grapes
1239	6/11/2019	Pencils
1239	6/11/2019	Harness
1239	6/11/2019	Toothbrush
1239	7/10/2021	Pencils
1239	7/10/2021	Harness
1239	7/10/2021	Toothbrush
1297	12/20/2018	Apple
1297	1/15/2019	Grapes
1345	3/19/2016	Chicken
1345	3/19/2016	Steak
1345	4/11/2017	Chicken
1345	4/11/2017	Steak
1345	4/11/2017	Ribs

For Account 1278, we started with an Apple, Desk and Pear. On 3/10, we can see that apple, pear and desk still exist, but Grapes were added. Then on 4/15, we can see that Apple, Pear and Grapes exist, so Desk was removed.

So expected result: Grapes added, Desk removed. That's it. Just looking to see which items were added/removed from the earliest start date.

For account 1239 - expected result: The same items exist. No change to report on this one.

For account 1297 - expected result: Apple changed to Grapes.

For account 1345 - expected result: Ribs were added.

I'm at a loss as to how to tackle this. I tried:

library(tidyverse)
library(readxl)
library(openxlsx)

sampledata <- read_excel("C:/SampleCCData.xlsx") 


sampledata %>% 
  gather(., account, )

But not sure how to proceed - is gather and spread a correct approach?

Any help is much appreciated! Thanks!

CodePudding user response：

Here is one approach using the tidyverse:

library(dplyr)
library(tidyr)

mydat <- mydat %>% 
  mutate(Inventory_Date = lubridate::mdy(Inventory_Date))

# Create initial data for later join
init_dat <- mydat %>%
  group_by(Account, Inventory_Date) %>% 
  summarise(Init_Items = list(unique(Item))) %>% 
  group_by(Account) %>% 
  filter(Inventory_Date == min(Inventory_Date)) %>% 
  select(-Inventory_Date)
#> `summarise()` has grouped output by 'Account'. You can override using the
#> `.groups` argument.


mydat %>% 
  # filter all rows expect the first date / initial data
  group_by(Account) %>% 
  filter(Inventory_Date != min(Inventory_Date)) %>% 
  select(-Inventory_Date) %>% 
  # sum Items in list()
  group_by(Account) %>% 
  summarise(Items = list(unique(Item))) %>% 
  # join initial data
  left_join(init_dat, by = "Account") %>% 
  rowwise() %>% 
  # check if items were removed / added
  mutate(added = list(setdiff(Items, Init_Items)),
         removed = list(setdiff(Init_Items, Items))) %>% 
  select(!ends_with("Items")) %>% 
  # unnest/  clean
  unnest(added, keep_empty = TRUE) %>% 
  unnest(removed, keep_empty = TRUE)  

#> # A tibble: 4 x 3
#>   Account added  removed
#>     <int> <chr>  <chr>  
#> 1    1239 <NA>   <NA>   
#> 2    1278 Grapes <NA>   
#> 3    1297 Grapes Apple  
#> 4    1345 Ribs   <NA>

^{Created on 2021-12-28 by the reprex package (v0.3.0)}

# data

mydat <- data.frame(
  stringsAsFactors = FALSE,
  Account = c(1278L,1278L,1278L,1278L,
              1278L,1278L,1278L,1278L,1278L,1278L,1239L,1239L,
              1239L,1239L,1239L,1239L,1297L,1297L,1345L,1345L,1345L,
              1345L,1345L),
  Inventory_Date = c("1/1/2020","1/1/2020",
                     "1/1/2020","3/10/2021","3/10/2021","3/10/2021","3/10/2021",
                     "4/15/2021","4/15/2021","4/15/2021","6/11/2019",
                     "6/11/2019","6/11/2019","7/10/2021","7/10/2021",
                     "7/10/2021","12/20/2018","1/15/2019","3/19/2016","3/19/2016",
                     "4/11/2017","4/11/2017","4/11/2017"),
  Item = c("Apple","Desk","Pear",
           "Apple","Desk","Pear","Grapes","Apple","Pear","Grapes",
           "Pencils","Harness","Toothbrush","Pencils","Harness",
           "Toothbrush","Apple","Grapes","Chicken","Steak",
           "Chicken","Steak","Ribs")
)

CodePudding user response：

Here is a dplyr solution. The key idea is: those added are the ones not observed in the last record, and those removed are the ones not observed in the current record.

library(dplyr)

vsetdiff <- Vectorize(setdiff, c("x", "y"))

sampledata %>% 
  mutate(Inventory_Date = as.Date(Inventory_Date, "%m/%d/%Y")) %>% 
  group_by(Account, Inventory_Date) %>% 
  summarise(Item = list(Item), .groups = "drop_last") %>% 
  mutate(last_item = lag(Item, 1L, Item[1L], Inventory_Date)) %>% 
  summarise(
    added = toString(unlist(vsetdiff(Item, last_item))), 
    removed = toString(unlist(vsetdiff(last_item, Item))), 
  )

Results

# A tibble: 4 x 3
  Account added    removed
    <int> <chr>    <chr>  
1    1239 ""       ""     
2    1278 "Grapes" "Desk" 
3    1297 "Grapes" "Apple"
4    1345 "Ribs"   ""

Data

sampledata <- structure(list(Account = c(1278L, 1278L, 1278L, 1278L, 1278L, 
1278L, 1278L, 1278L, 1278L, 1278L, 1239L, 1239L, 1239L, 1239L, 
1239L, 1239L, 1297L, 1297L, 1345L, 1345L, 1345L, 1345L, 1345L
), Inventory_Date = c("1/1/2020", "1/1/2020", "1/1/2020", "3/10/2021", 
"3/10/2021", "3/10/2021", "3/10/2021", "4/15/2021", "4/15/2021", 
"4/15/2021", "6/11/2019", "6/11/2019", "6/11/2019", "7/10/2021", 
"7/10/2021", "7/10/2021", "12/20/2018", "1/15/2019", "3/19/2016", 
"3/19/2016", "4/11/2017", "4/11/2017", "4/11/2017"), Item = c("Apple", 
"Desk", "Pear", "Apple", "Desk", "Pear", "Grapes", "Apple", "Pear", 
"Grapes", "Pencils", "Harness", "Toothbrush", "Pencils", "Harness", 
"Toothbrush", "Apple", "Grapes", "Chicken", "Steak", "Chicken", 
"Steak", "Ribs")), row.names = c(NA, -23L), class = c("tbl_df", 
"tbl", "data.frame"))