I have a dataset with people's complete age as strings (e.g., "10 years 8 months 23 days) in R, and I need to transform it into a numeric variable that makes sense. I'm thinking about converting it to how many days of age the person has (which is hard because months have different amounts of days). So the best solution might be creating a double variable that would show age as 10.6 or 10.8, some numeric variable that carries the information that 10years 8month 5days is greater than 10years 7month 12days.
Here is an example of the current variable I have
library(tibble)
age <- tibble(complete_age =
c("10 years 8 months 23 days",
"9 years 11 months 7 days",
"11 years 3 months 1 day",
"8 years 6 months 12 days"))
age
# A tibble: 4 x 1
complete_age
<chr>
1 10 years 8 months 23 days
2 9 years 11 months 7 days
3 11 years 3 months 1 day
4 8 years 6 months 12 days
Here is an example of a possible outcome I would love to see (with approximated values for age_num)
> age
# A tibble: 4 x 2
complete_age age_num
<chr> <dbl>
1 10 years 8 months 23 days 10.66
2 9 years 11 months 7 days 9.92
3 11 years 3 months 1 day 11.27
4 8 years 6 months 12 days 8.52
In summary, I have a dataset with the "complete_age" column, and I want to create the column "age_num."
How to do that in R? I'm having a hard time trying to use stringr
and lubridate
but maybe this is the way to go?
CodePudding user response:
Split on space, then compute. Note, you might want to change the average days in a year, in a month as needed:
age %>%
separate(complete_age, into = c("Y", NA, "M", NA, "D", NA),
convert = TRUE, remove = FALSE) %>%
transmute(complete_age, age_num = Y (M * 30.45 D) / 365.25)
# # A tibble: 4 x 2
# complete_age age_num
# <chr> <dbl>
# 1 10 years 8 months 23 days 10.7
# 2 9 years 11 months 7 days 9.94
# 3 11 years 3 months 1 day 11.3
# 4 8 years 6 months 12 days 8.53
CodePudding user response:
Here is an alternative approach:
- Remove all alphapetic
'[A-Za-z]'
withstr_remove_all
- seperate the resulting numbers
- apply calculation after switching to numeric with
type.convert(as.is = TRUE)
- rebind to original cols with
bind_cols
library(dplyr)
library(stringr)
age %>%
mutate(complete_age = str_remove_all(complete_age, "[A-Za-z]")) %>%
separate(complete_age, c("year", "month", "day")) %>%
type.convert(as.is = TRUE) %>%
mutate(ageYear = (year month/12 day/365), .keep="unused") %>%
bind_cols(age)
ageYear complete_age
<dbl> <chr>
1 10.7 10 years 8 months 23 days
2 9.94 9 years 11 months 7 days
3 11.3 11 years 3 months 1 day
4 8.53 8 years 6 months 12 days
CodePudding user response:
Using lubridate
convenience functions, period
and time_length
:
library(lubridate)
age %>%
mutate(age_years = time_length(period(complete_age), unit = "years"))
# A tibble: 4 x 2
# complete_age age_years
# <chr> <dbl>
# 1 10 years 8 months 23 days 10.729637
# 2 9 years 11 months 7 days 9.935832
# 3 11 years 3 months 1 day 11.252738
# 4 8 years 6 months 12 days 8.532854