I have a data frame like this:
ID <- c("A", "B", "C", "D")
birthday <- c(12, 23, 2, 20)
birthmonth <- c(8, 10, 3, 9)
birthyear <- c(79, 62, 66, 83)
mydf <- data.frame(ID, birthday, birthmonth, birthyear)
mydf
ID birthday birthmonth birthyear
1 A 12 8 79
2 B 23 10 62
3 C 2 3 66
4 D 20 9 83
So, as you can see birth years are stated as 2 digits, and month, day, and year information are on different columns. In such a data frame, how can I calculate mean age for my sample?
Thank you so much!
CodePudding user response:
We could use lubridate
's make_date()
to turn the individual columns into a date column and then calculate the age. I have shown here how you could take care of the missing 19
/20
in birthyear
, but you might need to tweak it for your data.
library(dplyr)
library(lubridate)
mydf |>
mutate(date = make_date(if_else(birthyear > 21, birthyear 1900, birthyear), birthmonth, birthday),
age = as.period(interval(date, today()))$year
)
Output:
ID birthday birthmonth birthyear date age
1 A 12 8 79 1979-08-12 43
2 B 23 10 62 1962-10-23 59
3 C 2 3 66 1966-03-02 56
4 D 20 9 83 1983-09-20 38
And to get the mean
age with summarise
:
mydf |>
mutate(date = make_date(if_else(birthyear > 21, birthyear 1900, birthyear), birthmonth, birthday),
age = as.period(interval(date, today()))$year
) |>
summarise(mean_age = mean(age))
Output:
mean_age
1 49
Update: It can be non-trivial to get the right age calculation (fast), check e.g. Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date