Home > Blockchain >  In R, use two date columns to calculate the average age of ID's at first event
In R, use two date columns to calculate the average age of ID's at first event

Time:03-12

Background

I've got an R dataframe, d:

d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
                birthdate = as.Date(c("1980-01-01","1980-01-01","2000-12-23","2000-12-23","1949-03-14","1949-03-14","1949-03-14")),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
                stringsAsFactors=FALSE)

It consists of an ID code and two dates: a birthdate and an event_date. Everyone's got a consistent birthdate, but people have multiple events each, all of these occurring on different dates.

The Problem

I'm trying to calculate the average age of people (IDs) in d at their first event. In other words, I'd like to get R to calculate an "age at first event" by subtracting each ID's first event from their birthdate, and then sum them and divide by n (3, in this case).

The answer (if my arithmetic isn't too far off this late at night) should be ~35.3 years old.

What I've tried

I'm not too familiar with date work in R, so I've only gotten so far as mutating a new column that calculates the difference between event_date and birthdate for that row:

d <- d %>% 
  mutate(date_difference = (event_date-birthdate)/365)

But I'm still a ways away from my summary calculation. I'm mainly hung up on how to tell R to find the first date-difference for each ID. (Not to mention that dividing by 365 gives me correct years but they're still labeled "days" in the resulting df.)

CodePudding user response:

You can use this code:

d <- d %>%
  group_by(ID) %>%
  arrange(event_date) %>%
  slice(1) %>%
  mutate(date_difference = as.numeric((event_date-birthdate)/365)) %>%
  ungroup() %>%
  mutate(average_age = mean(date_difference))

Output:

# A tibble: 3 × 5
  ID    birthdate  event_date date_difference average_age
  <chr> <date>     <date>               <dbl>       <dbl>
1 a     1980-01-01 2011-01-01            31.0        35.4
2 b     2000-12-23 2011-12-23            11.0        35.4
3 c     1949-03-14 2013-03-14            64.0        35.4

CodePudding user response:

As an alternative to dividing by 365, you can use the lubridate::time_length function. It computes the length of a period in different time units (seconds, minutes, days, years).

library("tidyverse")

dat <- data.frame(
  ID = c("a", "a", "b", "b", "c", "c", "c"),
  birthdate = as.Date(c("1980-01-01", "1980-01-01", "2000-12-23", "2000-12-23", "1949-03-14", "1949-03-14", "1949-03-14")),
  event_date = as.Date(c("2011-01-01", "2012-08-21", "2011-12-23", "2011-12-31", "2013-03-14", "2013-04-07", "2014-07-14")),
  stringsAsFactors = FALSE
)

dat_with_age <- dat %>%
  group_by(ID) %>%
  slice_min(
    event_date,
    n = 1
  ) %>%
  ungroup() %>%
  mutate(
    age_at_first_event = lubridate::time_length(event_date - birthdate, unit = "year")
  )
dat_with_age
#> # A tibble: 3 × 4
#>   ID    birthdate  event_date age_at_first_event
#>   <chr> <date>     <date>                  <dbl>
#> 1 a     1980-01-01 2011-01-01               31.0
#> 2 b     2000-12-23 2011-12-23               11.0
#> 3 c     1949-03-14 2013-03-14               64

dat_with_age %>%
  summarise(
    mean(age_at_first_event)
  )
#> # A tibble: 1 × 1
#>   `mean(age_at_first_event)`
#>                        <dbl>
#> 1                       35.3

Created on 2022-03-11 by the reprex package (v2.0.1)

  • Related