I'm working on a somewhat complex problem for my job. For purposes of explaining the problem, I am simplifying the data into something not real world, but will explain the overall concept.
Problem: I have a set of data by account. For each account, an inventory is performed at different intervals to see what items exist. Normally, the items should be exactly the same from start date, through each inventory date.
We are looking to see, for each account, which items changed from the start date. Here is the data:
Account | Inventory_Date | Item |
---|---|---|
1278 | 1/1/2020 | Apple |
1278 | 1/1/2020 | Desk |
1278 | 1/1/2020 | Pear |
1278 | 3/10/2021 | Apple |
1278 | 3/10/2021 | Desk |
1278 | 3/10/2021 | Pear |
1278 | 3/10/2021 | Grapes |
1278 | 4/15/2021 | Apple |
1278 | 4/15/2021 | Pear |
1278 | 4/15/2021 | Grapes |
1239 | 6/11/2019 | Pencils |
1239 | 6/11/2019 | Harness |
1239 | 6/11/2019 | Toothbrush |
1239 | 7/10/2021 | Pencils |
1239 | 7/10/2021 | Harness |
1239 | 7/10/2021 | Toothbrush |
1297 | 12/20/2018 | Apple |
1297 | 1/15/2019 | Grapes |
1345 | 3/19/2016 | Chicken |
1345 | 3/19/2016 | Steak |
1345 | 4/11/2017 | Chicken |
1345 | 4/11/2017 | Steak |
1345 | 4/11/2017 | Ribs |
For Account 1278, we started with an Apple, Desk and Pear. On 3/10, we can see that apple, pear and desk still exist, but Grapes were added. Then on 4/15, we can see that Apple, Pear and Grapes exist, so Desk was removed.
So expected result: Grapes added, Desk removed. That's it. Just looking to see which items were added/removed from the earliest start date.
For account 1239 - expected result: The same items exist. No change to report on this one.
For account 1297 - expected result: Apple changed to Grapes.
For account 1345 - expected result: Ribs were added.
I'm at a loss as to how to tackle this. I tried:
library(tidyverse)
library(readxl)
library(openxlsx)
sampledata <- read_excel("C:/SampleCCData.xlsx")
sampledata %>%
gather(., account, )
But not sure how to proceed - is gather and spread a correct approach?
Any help is much appreciated! Thanks!
CodePudding user response:
Here is one approach using the tidyverse:
library(dplyr)
library(tidyr)
mydat <- mydat %>%
mutate(Inventory_Date = lubridate::mdy(Inventory_Date))
# Create initial data for later join
init_dat <- mydat %>%
group_by(Account, Inventory_Date) %>%
summarise(Init_Items = list(unique(Item))) %>%
group_by(Account) %>%
filter(Inventory_Date == min(Inventory_Date)) %>%
select(-Inventory_Date)
#> `summarise()` has grouped output by 'Account'. You can override using the
#> `.groups` argument.
mydat %>%
# filter all rows expect the first date / initial data
group_by(Account) %>%
filter(Inventory_Date != min(Inventory_Date)) %>%
select(-Inventory_Date) %>%
# sum Items in list()
group_by(Account) %>%
summarise(Items = list(unique(Item))) %>%
# join initial data
left_join(init_dat, by = "Account") %>%
rowwise() %>%
# check if items were removed / added
mutate(added = list(setdiff(Items, Init_Items)),
removed = list(setdiff(Init_Items, Items))) %>%
select(!ends_with("Items")) %>%
# unnest/ clean
unnest(added, keep_empty = TRUE) %>%
unnest(removed, keep_empty = TRUE)
#> # A tibble: 4 x 3
#> Account added removed
#> <int> <chr> <chr>
#> 1 1239 <NA> <NA>
#> 2 1278 Grapes <NA>
#> 3 1297 Grapes Apple
#> 4 1345 Ribs <NA>
Created on 2021-12-28 by the reprex package (v0.3.0)
# data
mydat <- data.frame(
stringsAsFactors = FALSE,
Account = c(1278L,1278L,1278L,1278L,
1278L,1278L,1278L,1278L,1278L,1278L,1239L,1239L,
1239L,1239L,1239L,1239L,1297L,1297L,1345L,1345L,1345L,
1345L,1345L),
Inventory_Date = c("1/1/2020","1/1/2020",
"1/1/2020","3/10/2021","3/10/2021","3/10/2021","3/10/2021",
"4/15/2021","4/15/2021","4/15/2021","6/11/2019",
"6/11/2019","6/11/2019","7/10/2021","7/10/2021",
"7/10/2021","12/20/2018","1/15/2019","3/19/2016","3/19/2016",
"4/11/2017","4/11/2017","4/11/2017"),
Item = c("Apple","Desk","Pear",
"Apple","Desk","Pear","Grapes","Apple","Pear","Grapes",
"Pencils","Harness","Toothbrush","Pencils","Harness",
"Toothbrush","Apple","Grapes","Chicken","Steak",
"Chicken","Steak","Ribs")
)
CodePudding user response:
Here is a dplyr
solution. The key idea is: those added are the ones not observed in the last record, and those removed are the ones not observed in the current record.
library(dplyr)
vsetdiff <- Vectorize(setdiff, c("x", "y"))
sampledata %>%
mutate(Inventory_Date = as.Date(Inventory_Date, "%m/%d/%Y")) %>%
group_by(Account, Inventory_Date) %>%
summarise(Item = list(Item), .groups = "drop_last") %>%
mutate(last_item = lag(Item, 1L, Item[1L], Inventory_Date)) %>%
summarise(
added = toString(unlist(vsetdiff(Item, last_item))),
removed = toString(unlist(vsetdiff(last_item, Item))),
)
Results
# A tibble: 4 x 3
Account added removed
<int> <chr> <chr>
1 1239 "" ""
2 1278 "Grapes" "Desk"
3 1297 "Grapes" "Apple"
4 1345 "Ribs" ""
Data
sampledata <- structure(list(Account = c(1278L, 1278L, 1278L, 1278L, 1278L,
1278L, 1278L, 1278L, 1278L, 1278L, 1239L, 1239L, 1239L, 1239L,
1239L, 1239L, 1297L, 1297L, 1345L, 1345L, 1345L, 1345L, 1345L
), Inventory_Date = c("1/1/2020", "1/1/2020", "1/1/2020", "3/10/2021",
"3/10/2021", "3/10/2021", "3/10/2021", "4/15/2021", "4/15/2021",
"4/15/2021", "6/11/2019", "6/11/2019", "6/11/2019", "7/10/2021",
"7/10/2021", "7/10/2021", "12/20/2018", "1/15/2019", "3/19/2016",
"3/19/2016", "4/11/2017", "4/11/2017", "4/11/2017"), Item = c("Apple",
"Desk", "Pear", "Apple", "Desk", "Pear", "Grapes", "Apple", "Pear",
"Grapes", "Pencils", "Harness", "Toothbrush", "Pencils", "Harness",
"Toothbrush", "Apple", "Grapes", "Chicken", "Steak", "Chicken",
"Steak", "Ribs")), row.names = c(NA, -23L), class = c("tbl_df",
"tbl", "data.frame"))