Home > database >  R: Convert column with mixed 5-digits and character to date
R: Convert column with mixed 5-digits and character to date

Time:06-18

I read in an excel file which contains a column date of death that contains both date and strings. It looked fine in the excel, but when it is read into R, all of the dates got converted into 5 digits - looking something like this:

#>  date of death  
#>   <chr>           
#> 1 44673         
#> 2 44674         
#> 3 alive
#> 4 not known
#> 5 NA

Is there a way where

  1. I could read in the excel file in a way that it doesn't get converted to the 5 digits?
  2. If 1 is not possible, is there a way to convert that column into dates but for the numbers only?

CodePudding user response:

If we want to keep the string as is, then we may convert to Date class on the numeric elements after reading and then coalesce with the original column. Note that column can have only a single type, so we may need to return as character type as there are strings

library(dplyr)
df1 <- df1 %>% 
    mutate(`date of death` = coalesce(as.character(janitor::excel_numeric_to_date(
          as.numeric(`date of death`))), `date of death`))

-output

df1
# A tibble: 6 × 2
  `date of death`  col2
  <chr>           <dbl>
1 2022-04-22          5
2 2022-04-23          3
3 alive               9
4 not known          10
5 NA                 11

Tried with read.xlsx (from openxlsx) and read_excel (from readxl). Based on the example created, read_excel does convert to Date class when specified the col_types, but that will also result in NA for the other string values in the same column. However, read.xlsx with detectDates does the conversion

library(openxlsx)
df1 <- read.xlsx(file.choose(),  detectDates = TRUE)
df1
  date.of.death col2
1 2022-04-22      5
2 2022-04-23      3
3      alive      9
4   not known    10
5        <NA>    11

> sapply(df1, class)
date.of.death          col2 
  "character"     "numeric" 

To keep the spaces in the column names, we may need to specify the check.names and the sep.names

df1 <- read.xlsx(file.choose(),  detectDates = TRUE,
      check.names = FALSE, sep.names = " ")
  • Related