I am trying to split the values based on dates in r. Here is a dummy of my data:
df = data.frame(ID= c('A','B','C'),
year = c('2019','2019','2020'),
start= c('201850','201940','201850'),
end= c('201903','202002','202110'),
value = c(45,14,117))
ID year start end value
A 2019 201850 201903 45
B 2019 201940 202002 14
C 2020 201850 202110 117
What I need in the output is the value divided on year-week level. Only for years after 2019 with the assumptrion that value is distributed evenly over weeks. For instance the first row of the dummy data (A) will be only 3 weeks in 2019. Since the value for A is for 5 weeks(201850 to 201903), for 3 weeks in 2019 it will be equal to 27().
Desired output is:
ID year start end value
A 2019 201901 201903 27
B 2019 201940 201952 12
B 2020 202001 202002 2
C 2019 201901 201952 52
C 2020 202001 202053 53
C 2021 202101 202110 10
CodePudding user response:
We could write a function that works on every row together with dplyr::summarise
. It is quite verbose, but it is working. You would need to specify how to round the output. One caveat is, that the function below assumes a year has 52 weeks so it does not yield exact values for years with 53 weeks.
df = data.frame(ID= c('A','B','C'),
year = c('2019','2019','2020'),
start= c('201850','201940','201850'),
end= c('201903','202002','202110'),
value = c(45,14,117))
library(tidyverse)
library(lubridate)
split_weeks <- function(start, end, value) {
start_year <- as.numeric(str_extract(start, "^[0-9]{4}"))
start_week <- as.numeric(str_extract(start, "[0-9]{2}$"))
end_year <- as.numeric(str_extract(end, "^[0-9]{4}"))
end_week <- as.numeric(str_extract(end, "[0-9]{2}$"))
seq_year <- seq(start_year, end_year)
ln_out <- length(seq_year)
out_start <- vector("integer", length = ln_out)
out_end <- vector("integer", length = ln_out)
out_weight <- vector("integer", length = ln_out)
for (i in seq_len(ln_out)) {
if (i == 1) {
out_start[i] <- start_week
out_end[i] <- if(ln_out > 1) 52L else end_week
out_weight[i] <- out_end[i] - start_week
} else if (i == ln_out) {
out_start[i] <- 1L
out_end[i] <- end_week
out_weight[i] <- out_end[i] - out_start[i] 1L
} else {
out_start[i] <- 1L
out_end[i] <- 52L
out_weight[i] <- out_end[i] - out_start[i] 1L
}
}
out <- tibble(year = seq_year,
start = out_start,
end = out_end,
weight = out_weight,
value = value)
out <- mutate(out,
value = (value * weight / sum(weight)),
across(c(start, end), ~paste0(year, str_pad(.x, 2, pad = "0")))
)
select(out, -weight)
}
df %>%
rowwise(ID) %>%
summarise(split_weeks(start, end, value)) %>%
filter(year != 2018)
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
#> # A tibble: 6 x 5
#> # Groups: ID [3]
#> ID year start end value
#> <chr> <int> <chr> <chr> <dbl>
#> 1 A 2019 201901 201903 27
#> 2 B 2019 201940 201952 12
#> 3 B 2020 202001 202002 2
#> 4 C 2019 201901 201952 52.4
#> 5 C 2020 202001 202052 52.4
#> 6 C 2021 202101 202110 10.1
Created on 2021-12-27 by the reprex package (v0.3.0)