Home > OS >  Creating rows for data that does not exist in R
Creating rows for data that does not exist in R

Time:10-19

I am relatively new to R and would appreciate any help on this topic.

I have two datasets. One contains a list of every zip code in the US (DATASET 1). The other dataset contains the product, the product availability, the zip code it was sold in, and the year it was sold (DATASET 2).

For the 2020 data, the zip code will show up even if the toy was not sold in that zip code. But for 2021, if the toy is not offered in that zip code, it doesn’t show up in the dataset at all. Example of DATASET 2:

ZIP    PRODUCT     AVAILABILITY  YEAR 
10000   TOY        YES          2020        
10001   TOY        NO           2020     
10002   TOY        YES          2020        
10000   TOY        YES          2021        
10002   TOY        YES          2021      

Basically, in the above table you can see that zip code 10001 does not show up in 2021 since the toy was not sold that year.

With the two datasets I have available, I want to add rows to DATASET 2 such that if the toy was not sold in 2021, it will add a row like:

ZIP    PRODUCT     AVAILABILITY  YEAR 
10001   TOY        NO            2021 

So far, I am thinking of doing something along the lines of

IF YEAR(2021) AND (PRODUCT does not exist in given ZIP)
THEN [AVAILABILITY] = "NO"

However, I am completely stumped on the syntax and how to manipulate the two datasets such that DATASET 2 can be cross-checked against DATASET 1 for the full list of zip codes.

Again, any advice you can provide is greatly appreciated. Thank you!

CodePudding user response:

I heavily recommend you look into tsibble when dealing with dates

library(tidyverse)
library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, union


data_exemple_1 <- tibble::tribble(
    ~ZIP, ~PRODUCT, ~AVAILABILITY, ~YEAR,
  10000L,    "TOY",         "YES", 2020L,
  10001L,    "TOY",          "NO", 2020L,
  10002L,    "TOY",         "YES", 2020L,
  10000L,    "TOY",         "YES", 2021L,
  10002L,    "TOY",         "YES", 2021L
  )



data_exemple_1 |>
  tsibble(index = YEAR,key = c(ZIP,PRODUCT)) |> 
  tsibble::fill_gaps(AVAILABILITY = 'NO',.full = TRUE) 
#> # A tsibble: 6 x 4 [1Y]
#> # Key:       ZIP, PRODUCT [3]
#>     ZIP PRODUCT AVAILABILITY  YEAR
#>   <int> <chr>   <chr>        <dbl>
#> 1 10000 TOY     YES           2020
#> 2 10000 TOY     YES           2021
#> 3 10001 TOY     NO            2020
#> 4 10001 TOY     NO            2021
#> 5 10002 TOY     YES           2020
#> 6 10002 TOY     YES           2021

Created on 2021-10-18 by the reprex package (v2.0.1)

CodePudding user response:

You can use tidyr::complete -

tidyr::complete(df, ZIP, PRODUCT, YEAR, fill = list(AVAILABILITY = 'NO'))

#   ZIP PRODUCT  YEAR AVAILABILITY
#  <int> <chr>   <int> <chr>       
#1 10000 TOY      2020 YES         
#2 10000 TOY      2021 YES         
#3 10001 TOY      2020 NO          
#4 10001 TOY      2021 NO          
#5 10002 TOY      2020 YES         
#6 10002 TOY      2021 YES         
  •  Tags:  
  • r
  • Related