If i have the following dataframe:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30)
)
And would like to find the lag of the values so that i produce the following dataframe:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30),
output = c(NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA, 10)
)
The periods are different - in that one period to lag by is clearly 5 - those going Q1,Q2,Q3,Q4,END and then a secondary lag is by period 2 - those going year_priorEND vs year_aheadEND and maybe even further.
Alternatively:
Instead would it be easier to just impute rows for these missing dates - for example making a condition that says, if there is no website which has a period of 5 (meaning that 5 values exist Q1,Q2,Q3,Q4,END) then generate the remainder of the rows for that website and period but have the values as NA, so something like this can be generated instead:
tibble(
period = c("2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END"),
website = c(
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube"
),
values = c(NA,NA,NA,NA,1, NA,NA,NA,NA,2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, NA,NA,NA,NA,20, NA,NA,NA,NA,30),
output = c(NA,NA,NA,NA,NA,NA,NA,NA,NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA,NA,NA,NA,NA,NA,NA,NA,NA, 10)
)
So without explicitly coding out which fields need to imputed - i assume some form of checking can occur per group?
Because in this case we can just use output = lag(values, 5)
as the periods are all consistent
CodePudding user response:
I feel that the output for youtube should be 20? If so, this is a possible solution, however janky:
library(tidyverse)
df <- tibble( period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c( "google", "google", "facebook",
"facebook", "facebook", "facebook",
"facebook", "facebook", "facebook",
"facebook", "facebook", "facebook",
"youtube", "youtube" ),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30) )
df %>%
extract(period, into = c("Year", "Period"), regex = "([0-9]*)(.*)") %>%
group_by(website) %>%
mutate(group_rank = match(Year, unique(Year)),
intstep = if_else(lag(Period) == "END" & group_rank == 2, lag(values), 0)) %>%
group_by(website, group_rank) %>%
mutate(outcome = if_else(group_rank == 2, first(intstep), as.double(NA))) %>%
ungroup() %>%
select(-c(group_rank,intstep))
#> # A tibble: 14 × 5
#> Year Period website values outcome
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2010 END google 1 NA
#> 2 2011 END google 2 1
#> 3 2010 Q1 facebook 1 NA
#> 4 2010 Q2 facebook 2 NA
#> 5 2010 Q3 facebook 3 NA
#> 6 2010 Q4 facebook 4 NA
#> 7 2010 END facebook 5 NA
#> 8 2011 Q1 facebook 6 5
#> 9 2011 Q2 facebook 7 5
#> 10 2011 Q3 facebook 8 5
#> 11 2011 Q4 facebook 9 5
#> 12 2011 END facebook 10 5
#> 13 2011 END youtube 20 NA
#> 14 2012 END youtube 30 20
The pipeline for this can probably be a lot neater and simpler, I am open to suggestions.
CodePudding user response:
If I understand correctly, the OP wants a calculate the year-on-year differences of the values of a period and the same period in the prior year for each website. There is a related question Calculating Lag for dataframe in R? where the OP explicitely asks to calculate 2011Q1 - 2010Q1 and so on including 2011END - 2010END.
Using lag()
will only work if the time sequences are complete and the number of positions to lag is always constant. This is not the case for the given dataset.
Therefore, I suggest to use an update self-join:
library(data.table)
setDT(inp)[, c("year", "qtr") := tstrsplit(period, "(?<=^\\d{4})", perl = TRUE,
type.convert = TRUE)][
, prior_year := year - 1L]
inp[inp, on = .(prior_year = year, qtr, website), output := x.values - i.values][]
period website values year qtr prior_year output 1: 2010END google 1 2010 END 2009 NA 2: 2011END google 2 2011 END 2010 1 3: 2010Q1 facebook 1 2010 Q1 2009 NA 4: 2010Q2 facebook 2 2010 Q2 2009 NA 5: 2010Q3 facebook 3 2010 Q3 2009 NA 6: 2010Q4 facebook 4 2010 Q4 2009 NA 7: 2010END facebook 5 2010 END 2009 NA 8: 2011Q1 facebook 6 2011 Q1 2010 5 9: 2011Q2 facebook 7 2011 Q2 2010 5 10: 2011Q3 facebook 8 2011 Q3 2010 5 11: 2011Q4 facebook 9 2011 Q4 2010 5 12: 2011END facebook 10 2011 END 2010 5 13: 2011END youtube 20 2011 END 2010 NA 14: 2012END youtube 30 2012 END 2011 10
Explanation
period
is split in two parts,year
andqtr
,year
is coerced to typeinteger
.prior_year
is computed adn appended to the dataset.- In the update self-join, for the same
website
andqtr
year
is matched withprior_year
. The difference in value is calcaluated and appended as new columnoutput
.