Formatting date column with different formats (including missing day information)

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).

I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.

Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem? -> I even tried to manage more complex orders-argument for lubricate::parse_date_time like

parse_date_time(input_col, orders="mdy", "my")

but this didn't work.

Here's the code:

library("tidyverse")
library("lubridate")

Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
                   "June 24, 2010","January 2005")

df_dates <- data.frame(Observation, Date_original)

# looking for a comma in the cell
comma_detect <- function(a_string){
  str_detect(a_string, ",")
}

# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
  if_else(comma_detect(input_col),
          parse_date_time(input_col, orders="mdy"),
          parse_date_time(input_col, orders="my"))
}

# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
  mutate(df, output_col = date_correction_row(input_col))
}

df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()

OUTPUT

  Observation Date_original output_col
1           1  October 2014 2014-10-01
2           2   August 2014 2014-08-01
3           3     June 2013 2013-06-01
4           4 June 24, 2010 2010-06-24
5           5  January 2005 2005-01-01

CodePudding user response：

Try each format and take the one that does not give NA.

output_col <- "Date"

within(df_dates, assign(output_col, pmin(na.rm = TRUE,
 as.Date(Date_original, "%B %d, %Y"), 
 as.Date(paste(Date_original, 1), "%B %Y %d"))))

This can also be done in lubridate and is more compact; however, it is a bit more fragile because even though the code works, if the arguments to coalesce were to be swapped then it no longer works correctly and that may not be immediately obvious.

library(dplyr)
library(lubridate)

output_col <- "Date"

df_dates %>% 
  mutate(!!output_col := coalesce(myd(paste(Date_original, 1), quiet = TRUE), 
    mdy(Date_original)))

CodePudding user response：

When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.

library(tidyverse)
df_dates %>% 
  mutate(
    output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>% 
    as.Date(., format = '%B %d, %Y')
  )

  Observation Date_original output_col
1           1  October 2014 2014-10-01
2           2   August 2014 2014-08-01
3           3     June 2013 2013-06-01
4           4 June 24, 2010 2010-06-24
5           5  January 2005 2005-01-01