Home > Software engineering >  Pulling lagged data but only for a particular season in R
Pulling lagged data but only for a particular season in R

Time:11-10

I have a particular data set that includes two variables. One is numeric and the other is a character that identifies the season and year that the numeric data came from. This is what the head of the data looks like:

   SeasonYear  mean
   <chr>      <dbl>
 1 winter2000 0.957
 2 spring2000 0.943
 3 summer2000 1.03 
 4 fall2000   0.981
 5 winter2001 1.06 
 6 spring2001 1.05 
 7 summer2001 1.02 
 8 fall2001   1.03 
 9 winter2002 1.02 
10 spring2002 1.05 

Now I am looking to pull a lag of this data but only for the previous spring so that my data looks something like this:

SeasonYear  mean     lag
   <chr>      <dbl>  <dbl> 
 1 winter2000 0.957   NA
 2 spring2000 0.943   NA
 3 summer2000 1.03    0.943
 4 fall2000   0.981   0.943
 5 winter2001 1.06    0.943
 6 spring2001 1.05    0.943
 7 summer2001 1.02    1.05
 8 fall2001   1.03    1.05
 9 winter2002 1.02    1.05
10 spring2002 1.05    1.05

I am also looking to go back 2 springs as well so that my data looks something like this:

SeasonYear  mean     lag
   <chr>      <dbl>  <dbl> 
 1 winter2000 0.957   NA
 2 spring2000 0.943   NA
 3 summer2000 1.03    NA
 4 fall2000   0.981   NA
 5 winter2001 1.06    NA
 6 spring2001 1.05    NA
 7 summer2001 1.02    0.943
 8 fall2001   1.03    0.943
 9 winter2002 1.02    0.943
10 spring2002 1.05    0.943

I know I can use the lag() function to get previous data in the dataframe but I am looking for a way to specify a function that pulls a particular type of lag like the one I mentioned.

CodePudding user response:

One option to achieve your desired result may look like so:

  1. Split your SeasonYear into season and year
  2. Add a column with the value for the spring in each year
  3. Get the n-th lag taking into account that for fall and summer its the (n-1)-th lag
library(tidyr)
library(dplyr)

lag_spring <- function(x, y, n = 1) {
  data.frame(x = x, season_year = y) %>%
    tidyr::extract(season_year, into = c("season", "year"), regex = "^(. ?)(\\d{4})$") %>%
    group_by(year) %>%
    mutate(springmean = x[season == "spring"]) %>%
    ungroup() %>%
    group_by(season) %>%
    mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
    ungroup() %>%
    pull(lag)
}

dd %>%
  mutate(lag = lag_spring(mean, SeasonYear))
#>    SeasonYear  mean   lag
#> 1  winter2000 0.957    NA
#> 2  spring2000 0.943    NA
#> 3  summer2000 1.030 0.943
#> 4    fall2000 0.981 0.943
#> 5  winter2001 1.060 0.943
#> 6  spring2001 1.050 0.943
#> 7  summer2001 1.020 1.050
#> 8    fall2001 1.030 1.050
#> 9  winter2002 1.020 1.050
#> 10 spring2002 1.050 1.050

dd %>%
  mutate(lag = lag_spring(mean, SeasonYear, n = 2))
#>    SeasonYear  mean   lag
#> 1  winter2000 0.957    NA
#> 2  spring2000 0.943    NA
#> 3  summer2000 1.030    NA
#> 4    fall2000 0.981    NA
#> 5  winter2001 1.060    NA
#> 6  spring2001 1.050    NA
#> 7  summer2001 1.020 0.943
#> 8    fall2001 1.030 0.943
#> 9  winter2002 1.020 0.943
#> 10 spring2002 1.050 0.943

DATA

dd <- structure(list(SeasonYear = c(
  "winter2000", "spring2000", "summer2000",
  "fall2000", "winter2001", "spring2001", "summer2001", "fall2001",
  "winter2002", "spring2002"
), mean = c(
  0.957, 0.943, 1.03, 0.981,
  1.06, 1.05, 1.02, 1.03, 1.02, 1.05
)), class = "data.frame", row.names = c(
  "1",
  "2", "3", "4", "5", "6", "7", "8", "9", "10"
))
  • Related