Time spent in each calendar year-CodePudding

I followed some individuals A and B from start to end

df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
df

  id      start        end
1  A 2015-01-01 2021-06-12
2  B 2013-01-01 2017-10-10

I would like to calculate the the follow up time for each calendar year. For example I have 1 year for 2013 (from B), 1 year for 2014 (from B), 2 years for 2015 (from A and B) and so on.

I tried to treat year as an integer and count how many years each individual contributes but due to rounding errors the result is not plausible.

I tried

years<-NULL

for (i in 1:length(df$id)){
  years<-c(years, as.character(as.Date(seq.Date(from = df$start[i], to = df$end[i], by = "day"))))
}
library(lubridate)
table(year(years))/365

 2013      2014      2015      2016      2017      2018      2019      2020      2021 
1.0000000 1.0000000 2.0000000 2.0054795 1.7753425 1.0000000 1.0000000 1.0027397 0.4465753

which is the answer I am trying to get but is computationally inefficient and very slow in large data. I am wondering is there any way to do this without the loop? Or do it more efficiently?

CodePudding user response：

Sounds like a job for a great package called lubridate. See example:

By the way, I assumed dates are year-month-day, therefore ymd. If not, you can use ydm (year-day-month) for American date format.

df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(tidyverse)
df %>%
    mutate(across(start:end, ymd),
           follow_up_years = interval(start, end)/years(1),
           follow_up_months = interval(start, end)/months(1),
           follow_up_days = interval(start, end)/days(1),
    )
#>   id      start        end follow_up_years follow_up_months follow_up_days
#> 1  A 2015-01-01 2021-06-12        6.443836         77.36667           2354
#> 2  B 2013-01-01 2017-10-10        4.772603         57.29032           1743

^{Created on 2021-10-28 by the reprex package (v2.0.1)}

Edit

I think I understand. I guess we can also just use lubridate intervals:

df %>%
    mutate(follow_up_2015 = interval(start, as_date("2015-01-01"))/years(1)) %>%
    pull(follow_up_2015) %>%
    sum()
#> [1] 2

^{Created on 2021-10-28 by the reprex package (v2.0.1)}

CodePudding user response：

I'm now guessing what you actually don't want to round or truncate anything, so here's a solution that works and gives output similar to your method (correcting the 2016 value):

func <- function(st, ed) {
  stopifnot(length(st) == 1, length(ed) == 1)
  stL <- as.POSIXlt(st)
  edL <- as.POSIXlt(ed)
  start_year <- 1900   stL$year
  end_year <- 1900   edL$year
  start_eoy <- as.POSIXlt(paste0(start_year, "-12-31"))
  end_eoy <- as.POSIXlt(paste0(end_year, "-12-31"))
  firstyear <- (start_eoy$yday - stL$yday) / start_eoy$yday
  lastyear <- edL$yday / end_eoy$yday
  data.frame(
    year = seq(start_year, end_year),
    n = c(firstyear, rep(1, max(0, end_year - start_year - 1)), lastyear)
  )
}

base R

aggregate(n ~ year, data = do.call(rbind, Map(func, df$start, df$end)), FUN = sum)
#   year         n
# 1 2013 1.0000000
# 2 2014 1.0000000
# 3 2015 2.0000000
# 4 2016 2.0000000
# 5 2017 1.7747253
# 6 2018 1.0000000
# 7 2019 1.0000000
# 8 2020 1.0000000
# 9 2021 0.4450549

dplyr

library(dplyr)
df %>%
  with(Map(func, start, end)) %>%
  bind_rows() %>%
  group_by(year) %>%
  summarize(n = sum(n))
# # A tibble: 9 x 2
#    year     n
#   <int> <dbl>
# 1  2013 1    
# 2  2014 1    
# 3  2015 2    
# 4  2016 2    
# 5  2017 1.77 
# 6  2018 1    
# 7  2019 1    
# 8  2020 1    
# 9  2021 0.445