I am relatively new to R and would appreciate any help on this topic.
I have two datasets. One contains a list of every zip code in the US (DATASET 1). The other dataset contains the product, the product availability, the zip code it was sold in, and the year it was sold (DATASET 2).
For the 2020 data, the zip code will show up even if the toy was not sold in that zip code. But for 2021, if the toy is not offered in that zip code, it doesn’t show up in the dataset at all. Example of DATASET 2:
ZIP PRODUCT AVAILABILITY YEAR
10000 TOY YES 2020
10001 TOY NO 2020
10002 TOY YES 2020
10000 TOY YES 2021
10002 TOY YES 2021
Basically, in the above table you can see that zip code 10001 does not show up in 2021 since the toy was not sold that year.
With the two datasets I have available, I want to add rows to DATASET 2 such that if the toy was not sold in 2021, it will add a row like:
ZIP PRODUCT AVAILABILITY YEAR
10001 TOY NO 2021
So far, I am thinking of doing something along the lines of
IF YEAR(2021) AND (PRODUCT does not exist in given ZIP)
THEN [AVAILABILITY] = "NO"
However, I am completely stumped on the syntax and how to manipulate the two datasets such that DATASET 2 can be cross-checked against DATASET 1 for the full list of zip codes.
Again, any advice you can provide is greatly appreciated. Thank you!
CodePudding user response:
I heavily recommend you look into tsibble when dealing with dates
library(tidyverse)
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, union
data_exemple_1 <- tibble::tribble(
~ZIP, ~PRODUCT, ~AVAILABILITY, ~YEAR,
10000L, "TOY", "YES", 2020L,
10001L, "TOY", "NO", 2020L,
10002L, "TOY", "YES", 2020L,
10000L, "TOY", "YES", 2021L,
10002L, "TOY", "YES", 2021L
)
data_exemple_1 |>
tsibble(index = YEAR,key = c(ZIP,PRODUCT)) |>
tsibble::fill_gaps(AVAILABILITY = 'NO',.full = TRUE)
#> # A tsibble: 6 x 4 [1Y]
#> # Key: ZIP, PRODUCT [3]
#> ZIP PRODUCT AVAILABILITY YEAR
#> <int> <chr> <chr> <dbl>
#> 1 10000 TOY YES 2020
#> 2 10000 TOY YES 2021
#> 3 10001 TOY NO 2020
#> 4 10001 TOY NO 2021
#> 5 10002 TOY YES 2020
#> 6 10002 TOY YES 2021
Created on 2021-10-18 by the reprex package (v2.0.1)
CodePudding user response:
You can use tidyr::complete
-
tidyr::complete(df, ZIP, PRODUCT, YEAR, fill = list(AVAILABILITY = 'NO'))
# ZIP PRODUCT YEAR AVAILABILITY
# <int> <chr> <int> <chr>
#1 10000 TOY 2020 YES
#2 10000 TOY 2021 YES
#3 10001 TOY 2020 NO
#4 10001 TOY 2021 NO
#5 10002 TOY 2020 YES
#6 10002 TOY 2021 YES