Home > Software engineering >  Transforming complete age from character to numeric in R
Transforming complete age from character to numeric in R

Time:12-02

I have a dataset with people's complete age as strings (e.g., "10 years 8 months 23 days) in R, and I need to transform it into a numeric variable that makes sense. I'm thinking about converting it to how many days of age the person has (which is hard because months have different amounts of days). So the best solution might be creating a double variable that would show age as 10.6 or 10.8, some numeric variable that carries the information that 10years 8month 5days is greater than 10years 7month 12days.

Here is an example of the current variable I have

library(tibble)

age <- tibble(complete_age = 
             c("10 years 8 months 23 days",
               "9 years 11 months 7 days",
               "11 years 3 months 1 day",
               "8 years 6 months 12 days")) 

age

# A tibble: 4 x 1
  complete_age             
  <chr>                    
1 10 years 8 months 23 days
2 9 years 11 months 7 days 
3 11 years 3 months 1 day  
4 8 years 6 months 12 days 

Here is an example of a possible outcome I would love to see (with approximated values for age_num)

> age
# A tibble: 4 x 2
  complete_age              age_num
  <chr>                       <dbl>
1 10 years 8 months 23 days    10.66
2 9 years 11 months 7 days      9.92
3 11 years 3 months 1 day      11.27
4 8 years 6 months 12 days      8.52

In summary, I have a dataset with the "complete_age" column, and I want to create the column "age_num."

How to do that in R? I'm having a hard time trying to use stringr and lubridate but maybe this is the way to go?

CodePudding user response:

Split on space, then compute. Note, you might want to change the average days in a year, in a month as needed:

age %>% 
  separate(complete_age, into = c("Y", NA, "M", NA, "D", NA), 
           convert = TRUE, remove = FALSE) %>% 
  transmute(complete_age, age_num = Y   (M * 30.45   D) / 365.25)

# # A tibble: 4 x 2
#   complete_age                 age_num
#   <chr>                          <dbl>
# 1 10 years 8 months 23 days      10.7 
# 2 9 years 11 months 7 days        9.94
# 3 11 years 3 months 1 day        11.3 
# 4 8 years 6 months 12 days        8.53

CodePudding user response:

Here is an alternative approach:

  1. Remove all alphapetic '[A-Za-z]' with str_remove_all
  2. seperate the resulting numbers
  3. apply calculation after switching to numeric with type.convert(as.is = TRUE)
  4. rebind to original cols with bind_cols
library(dplyr)
library(stringr)
age %>% 
  mutate(complete_age = str_remove_all(complete_age, "[A-Za-z]")) %>% 
  separate(complete_age, c("year", "month", "day")) %>% 
  type.convert(as.is = TRUE) %>% 
  mutate(ageYear = (year   month/12   day/365), .keep="unused") %>% 
  bind_cols(age)
  ageYear complete_age             
    <dbl> <chr>                    
1   10.7  10 years 8 months 23 days
2    9.94 9 years 11 months 7 days 
3   11.3  11 years 3 months 1 day  
4    8.53 8 years 6 months 12 days 

CodePudding user response:

Using lubridate convenience functions, period and time_length:

library(lubridate)
age %>% 
  mutate(age_years = time_length(period(complete_age), unit = "years"))

  # A tibble: 4 x 2
  #  complete_age              age_years
  # <chr>                         <dbl>
  # 1 10 years 8 months 23 days 10.729637
  # 2  9 years 11 months 7 days  9.935832
  # 3   11 years 3 months 1 day 11.252738
  # 4  8 years 6 months 12 days  8.532854
  • Related