Home > Enterprise >  How can I sift through a list to find elements that changed?
How can I sift through a list to find elements that changed?

Time:12-29

I'm working on a somewhat complex problem for my job. For purposes of explaining the problem, I am simplifying the data into something not real world, but will explain the overall concept.

Problem: I have a set of data by account. For each account, an inventory is performed at different intervals to see what items exist. Normally, the items should be exactly the same from start date, through each inventory date.

We are looking to see, for each account, which items changed from the start date. Here is the data:

Account Inventory_Date Item
1278 1/1/2020 Apple
1278 1/1/2020 Desk
1278 1/1/2020 Pear
1278 3/10/2021 Apple
1278 3/10/2021 Desk
1278 3/10/2021 Pear
1278 3/10/2021 Grapes
1278 4/15/2021 Apple
1278 4/15/2021 Pear
1278 4/15/2021 Grapes
1239 6/11/2019 Pencils
1239 6/11/2019 Harness
1239 6/11/2019 Toothbrush
1239 7/10/2021 Pencils
1239 7/10/2021 Harness
1239 7/10/2021 Toothbrush
1297 12/20/2018 Apple
1297 1/15/2019 Grapes
1345 3/19/2016 Chicken
1345 3/19/2016 Steak
1345 4/11/2017 Chicken
1345 4/11/2017 Steak
1345 4/11/2017 Ribs

For Account 1278, we started with an Apple, Desk and Pear. On 3/10, we can see that apple, pear and desk still exist, but Grapes were added. Then on 4/15, we can see that Apple, Pear and Grapes exist, so Desk was removed.

So expected result: Grapes added, Desk removed. That's it. Just looking to see which items were added/removed from the earliest start date.

For account 1239 - expected result: The same items exist. No change to report on this one.

For account 1297 - expected result: Apple changed to Grapes.

For account 1345 - expected result: Ribs were added.

I'm at a loss as to how to tackle this. I tried:

library(tidyverse)
library(readxl)
library(openxlsx)

sampledata <- read_excel("C:/SampleCCData.xlsx") 


sampledata %>% 
  gather(., account, )

But not sure how to proceed - is gather and spread a correct approach?

Any help is much appreciated! Thanks!

CodePudding user response:

Here is one approach using the tidyverse:

library(dplyr)
library(tidyr)

mydat <- mydat %>% 
  mutate(Inventory_Date = lubridate::mdy(Inventory_Date))

# Create initial data for later join
init_dat <- mydat %>%
  group_by(Account, Inventory_Date) %>% 
  summarise(Init_Items = list(unique(Item))) %>% 
  group_by(Account) %>% 
  filter(Inventory_Date == min(Inventory_Date)) %>% 
  select(-Inventory_Date)
#> `summarise()` has grouped output by 'Account'. You can override using the
#> `.groups` argument.


mydat %>% 
  # filter all rows expect the first date / initial data
  group_by(Account) %>% 
  filter(Inventory_Date != min(Inventory_Date)) %>% 
  select(-Inventory_Date) %>% 
  # sum Items in list()
  group_by(Account) %>% 
  summarise(Items = list(unique(Item))) %>% 
  # join initial data
  left_join(init_dat, by = "Account") %>% 
  rowwise() %>% 
  # check if items were removed / added
  mutate(added = list(setdiff(Items, Init_Items)),
         removed = list(setdiff(Init_Items, Items))) %>% 
  select(!ends_with("Items")) %>% 
  # unnest/  clean
  unnest(added, keep_empty = TRUE) %>% 
  unnest(removed, keep_empty = TRUE)  

#> # A tibble: 4 x 3
#>   Account added  removed
#>     <int> <chr>  <chr>  
#> 1    1239 <NA>   <NA>   
#> 2    1278 Grapes <NA>   
#> 3    1297 Grapes Apple  
#> 4    1345 Ribs   <NA>

Created on 2021-12-28 by the reprex package (v0.3.0)

# data

mydat <- data.frame(
  stringsAsFactors = FALSE,
  Account = c(1278L,1278L,1278L,1278L,
              1278L,1278L,1278L,1278L,1278L,1278L,1239L,1239L,
              1239L,1239L,1239L,1239L,1297L,1297L,1345L,1345L,1345L,
              1345L,1345L),
  Inventory_Date = c("1/1/2020","1/1/2020",
                     "1/1/2020","3/10/2021","3/10/2021","3/10/2021","3/10/2021",
                     "4/15/2021","4/15/2021","4/15/2021","6/11/2019",
                     "6/11/2019","6/11/2019","7/10/2021","7/10/2021",
                     "7/10/2021","12/20/2018","1/15/2019","3/19/2016","3/19/2016",
                     "4/11/2017","4/11/2017","4/11/2017"),
  Item = c("Apple","Desk","Pear",
           "Apple","Desk","Pear","Grapes","Apple","Pear","Grapes",
           "Pencils","Harness","Toothbrush","Pencils","Harness",
           "Toothbrush","Apple","Grapes","Chicken","Steak",
           "Chicken","Steak","Ribs")
)

CodePudding user response:

Here is a dplyr solution. The key idea is: those added are the ones not observed in the last record, and those removed are the ones not observed in the current record.

library(dplyr)

vsetdiff <- Vectorize(setdiff, c("x", "y"))

sampledata %>% 
  mutate(Inventory_Date = as.Date(Inventory_Date, "%m/%d/%Y")) %>% 
  group_by(Account, Inventory_Date) %>% 
  summarise(Item = list(Item), .groups = "drop_last") %>% 
  mutate(last_item = lag(Item, 1L, Item[1L], Inventory_Date)) %>% 
  summarise(
    added = toString(unlist(vsetdiff(Item, last_item))), 
    removed = toString(unlist(vsetdiff(last_item, Item))), 
  )

Results

# A tibble: 4 x 3
  Account added    removed
    <int> <chr>    <chr>  
1    1239 ""       ""     
2    1278 "Grapes" "Desk" 
3    1297 "Grapes" "Apple"
4    1345 "Ribs"   ""   

Data

sampledata <- structure(list(Account = c(1278L, 1278L, 1278L, 1278L, 1278L, 
1278L, 1278L, 1278L, 1278L, 1278L, 1239L, 1239L, 1239L, 1239L, 
1239L, 1239L, 1297L, 1297L, 1345L, 1345L, 1345L, 1345L, 1345L
), Inventory_Date = c("1/1/2020", "1/1/2020", "1/1/2020", "3/10/2021", 
"3/10/2021", "3/10/2021", "3/10/2021", "4/15/2021", "4/15/2021", 
"4/15/2021", "6/11/2019", "6/11/2019", "6/11/2019", "7/10/2021", 
"7/10/2021", "7/10/2021", "12/20/2018", "1/15/2019", "3/19/2016", 
"3/19/2016", "4/11/2017", "4/11/2017", "4/11/2017"), Item = c("Apple", 
"Desk", "Pear", "Apple", "Desk", "Pear", "Grapes", "Apple", "Pear", 
"Grapes", "Pencils", "Harness", "Toothbrush", "Pencils", "Harness", 
"Toothbrush", "Apple", "Grapes", "Chicken", "Steak", "Chicken", 
"Steak", "Ribs")), row.names = c(NA, -23L), class = c("tbl_df", 
"tbl", "data.frame"))
  • Related