How to calculate LAG difference in R - if periods changes or how to add empty NA rows to dataframe?-CodePudding

If i have the following dataframe:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30)
)

And would like to find the lag of the values so that i produce the following dataframe:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30), 
  output = c(NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA, 10)
)

The periods are different - in that one period to lag by is clearly 5 - those going Q1,Q2,Q3,Q4,END and then a secondary lag is by period 2 - those going year_priorEND vs year_aheadEND and maybe even further.

Alternatively:

Instead would it be easier to just impute rows for these missing dates - for example making a condition that says, if there is no website which has a period of 5 (meaning that 5 values exist Q1,Q2,Q3,Q4,END) then generate the remainder of the rows for that website and period but have the values as NA, so something like this can be generated instead:

tibble(
  period = c("2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube"
  ),
  values = c(NA,NA,NA,NA,1, NA,NA,NA,NA,2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, NA,NA,NA,NA,20, NA,NA,NA,NA,30), 
  output = c(NA,NA,NA,NA,NA,NA,NA,NA,NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA,NA,NA,NA,NA,NA,NA,NA,NA, 10)
)

So without explicitly coding out which fields need to imputed - i assume some form of checking can occur per group? Because in this case we can just use output = lag(values, 5) as the periods are all consistent

CodePudding user response：

I feel that the output for youtube should be 20? If so, this is a possible solution, however janky:

library(tidyverse)
df <- tibble( period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
             website = c( "google", "google", "facebook",
                          "facebook", "facebook", "facebook",
                          "facebook", "facebook", "facebook",
                          "facebook", "facebook", "facebook",
                          "youtube", "youtube" ),
             values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30) )

df %>% 
  extract(period, into = c("Year", "Period"), regex = "([0-9]*)(.*)") %>% 
  group_by(website) %>% 
  mutate(group_rank = match(Year, unique(Year)),
         intstep = if_else(lag(Period) == "END" & group_rank == 2, lag(values), 0)) %>% 
  group_by(website, group_rank) %>% 
  mutate(outcome = if_else(group_rank == 2, first(intstep), as.double(NA))) %>% 
  ungroup() %>% 
  select(-c(group_rank,intstep))
#> # A tibble: 14 × 5
#>    Year  Period website  values outcome
#>    <chr> <chr>  <chr>     <dbl>   <dbl>
#>  1 2010  END    google        1      NA
#>  2 2011  END    google        2       1
#>  3 2010  Q1     facebook      1      NA
#>  4 2010  Q2     facebook      2      NA
#>  5 2010  Q3     facebook      3      NA
#>  6 2010  Q4     facebook      4      NA
#>  7 2010  END    facebook      5      NA
#>  8 2011  Q1     facebook      6       5
#>  9 2011  Q2     facebook      7       5
#> 10 2011  Q3     facebook      8       5
#> 11 2011  Q4     facebook      9       5
#> 12 2011  END    facebook     10       5
#> 13 2011  END    youtube      20      NA
#> 14 2012  END    youtube      30      20

The pipeline for this can probably be a lot neater and simpler, I am open to suggestions.

CodePudding user response：

If I understand correctly, the OP wants a calculate the year-on-year differences of the values of a period and the same period in the prior year for each website. There is a related question Calculating Lag for dataframe in R? where the OP explicitely asks to calculate 2011Q1 - 2010Q1 and so on including 2011END - 2010END.

Using lag() will only work if the time sequences are complete and the number of positions to lag is always constant. This is not the case for the given dataset.

Therefore, I suggest to use an update self-join:

library(data.table)
setDT(inp)[, c("year", "qtr") := tstrsplit(period, "(?<=^\\d{4})", perl = TRUE, 
                                           type.convert = TRUE)][
                                             , prior_year := year - 1L]
inp[inp, on = .(prior_year = year, qtr, website), output := x.values - i.values][]

     period  website values year qtr prior_year output
 1: 2010END   google      1 2010 END       2009     NA
 2: 2011END   google      2 2011 END       2010      1
 3:  2010Q1 facebook      1 2010  Q1       2009     NA
 4:  2010Q2 facebook      2 2010  Q2       2009     NA
 5:  2010Q3 facebook      3 2010  Q3       2009     NA
 6:  2010Q4 facebook      4 2010  Q4       2009     NA
 7: 2010END facebook      5 2010 END       2009     NA
 8:  2011Q1 facebook      6 2011  Q1       2010      5
 9:  2011Q2 facebook      7 2011  Q2       2010      5
10:  2011Q3 facebook      8 2011  Q3       2010      5
11:  2011Q4 facebook      9 2011  Q4       2010      5
12: 2011END facebook     10 2011 END       2010      5
13: 2011END  youtube     20 2011 END       2010     NA
14: 2012END  youtube     30 2012 END       2011     10

Explanation

period is split in two parts, year and qtr, year is coerced to type integer.
prior_year is computed adn appended to the dataset.
In the update self-join, for the same website and qtr year is matched with prior_year. The difference in value is calcaluated and appended as new column output.