Home > Software design >  Split rows yearly, based on weekly date columns
Split rows yearly, based on weekly date columns

Time:12-28

I am trying to split the values based on dates in r. Here is a dummy of my data:

df = data.frame(ID= c('A','B','C'),
            year = c('2019','2019','2020'),
            start= c('201850','201940','201850'),
            end= c('201903','202002','202110'),
            value = c(45,14,117))

 ID    year    start    end   value
  A    2019    201850  201903   45   
  B    2019    201940  202002   14 
  C    2020    201850  202110   117 

What I need in the output is the value divided on year-week level. Only for years after 2019 with the assumptrion that value is distributed evenly over weeks. For instance the first row of the dummy data (A) will be only 3 weeks in 2019. Since the value for A is for 5 weeks(201850 to 201903), for 3 weeks in 2019 it will be equal to 27().

Desired output is:

 ID    year    start    end   value
  A    2019    201901  201903   27   
  B    2019    201940  201952   12 
  B    2020    202001  202002   2 
  C    2019    201901  201952   52 
  C    2020    202001  202053   53 
  C    2021    202101  202110   10 

CodePudding user response:

We could write a function that works on every row together with dplyr::summarise. It is quite verbose, but it is working. You would need to specify how to round the output. One caveat is, that the function below assumes a year has 52 weeks so it does not yield exact values for years with 53 weeks.

df = data.frame(ID= c('A','B','C'),
                year = c('2019','2019','2020'),
                start= c('201850','201940','201850'),
                end= c('201903','202002','202110'),
                value = c(45,14,117))

library(tidyverse)
library(lubridate)

split_weeks <- function(start, end, value) {
  start_year <- as.numeric(str_extract(start, "^[0-9]{4}"))
  start_week <- as.numeric(str_extract(start, "[0-9]{2}$"))
  
  end_year <- as.numeric(str_extract(end, "^[0-9]{4}"))
  end_week <- as.numeric(str_extract(end, "[0-9]{2}$"))

  seq_year <- seq(start_year, end_year)
  
  ln_out <- length(seq_year)
  
  out_start <- vector("integer", length = ln_out)
  out_end <- vector("integer", length = ln_out)
  out_weight <- vector("integer", length = ln_out)
  
    for (i in seq_len(ln_out)) {
      if (i == 1) {
        out_start[i] <- start_week
        out_end[i] <- if(ln_out > 1) 52L else end_week
        out_weight[i] <- out_end[i] - start_week 
      } else if (i == ln_out) {
        out_start[i] <- 1L
        out_end[i] <- end_week
        out_weight[i] <- out_end[i] - out_start[i]   1L
      } else {
        out_start[i] <- 1L
        out_end[i] <- 52L
        out_weight[i] <- out_end[i] - out_start[i]   1L
      }
    }
  
  out <- tibble(year = seq_year,
                start = out_start,
                end = out_end,
                weight = out_weight,
                value = value)

  out <- mutate(out,
                value = (value * weight / sum(weight)),
                across(c(start, end), ~paste0(year, str_pad(.x, 2, pad = "0")))
                )
  select(out, -weight)
}

df %>% 
  rowwise(ID) %>% 
  summarise(split_weeks(start, end, value)) %>% 
  filter(year != 2018)

#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 x 5
#> # Groups:   ID [3]
#>   ID     year start  end    value
#>   <chr> <int> <chr>  <chr>  <dbl>
#> 1 A      2019 201901 201903  27  
#> 2 B      2019 201940 201952  12  
#> 3 B      2020 202001 202002   2  
#> 4 C      2019 201901 201952  52.4
#> 5 C      2020 202001 202052  52.4
#> 6 C      2021 202101 202110  10.1

Created on 2021-12-27 by the reprex package (v0.3.0)

  • Related