Parse date column with mix of date formats-CodePudding

I have to combine several spreadsheets for my project and for some reason they all decided to enter the dates differently. Below is a sample of my data with all of the different date formats I got after reading in and combining all of the spreadsheets.

date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018

I'm trying to standardize everything and turn it into a yyyy-mm-dd format. This is my attempt below.

library(lubridate)
parse_date_time(x = df$date,
                orders = c("Y m d, "d m y", "d B Y", "m/d/y", ),
                locale = "eng")

The problem is that the moment I reach a row with a different format, it stops working and just gives me NA's for the rest of the rows. How do I fix this?

CodePudding user response：

In my point of view there are 3 challenges.

First of all it is difficult to feed the orders argument of parse_date_time. I think after applying arrange we could get some control.

Second, the integer type of date must be handled separately I think.

Finally, it is not clear which is month in "2018-08-05" it could be 8 or 5.

library(lubridate)
library(dplyr)
df %>% 
  arrange(date) %>% 
  mutate(x = parse_date_time(date, orders=c("mdy", "dmy", "ymd")),
         y = as.integer(str_extract(date, '^\\d $')),
         y = as.Date(y, origin = "1899-12-30"),
         x = coalesce(x,y)) %>% 
  select(date, new_date=x)

 date   new_date
1  10/16/2018 2018-10-16
2 11-Mar-2019 2019-11-20
3   13-MAR-18 2018-03-13
4  2018-08-05 2018-08-05
5       43423 2018-11-19
Warning message:
Problem while computing `x = parse_date_time(date, orders =
c("mdy", "dmy", "ymd"))`.
i  1 failed to parse.

CodePudding user response：

Edit:

I've just realised that the conversion for "43423" to "2022-07-27 16:02:42" is incorrect using this approach (2022-07-27 is today's date), as further explained in the docs: https://github.com/gaborcsardi/parsedate

For this solution to work, you would need to handle 'integer' date formats first, then parsedate() the remaining dates. I think @TarJae's solution is the way to go.

Original answer:

Here is a potential solution using the parsedate package:

#install.packages("parsedate")
library(parsedate)

df <- read.table(text = "date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018", header = TRUE)

df$date_parsed <- parse_date(df$date)
df
#>          date         date_parsed
#> 1   13-MAR-18 2018-03-13 00:00:00
#> 2  2018-08-05 2018-08-05 00:00:00
#> 3       43423 2022-07-27 16:02:42
#> 4 11-Mar-2019 2019-03-11 00:00:00
#> 5  10/16/2018 2018-10-16 00:00:00

^{Created on 2022-07-27 by the reprex package (v2.0.1)}

CodePudding user response：

You need

as.Date(x, tryFormats = c()

See:

Use as.Date with tryFormats to parse dates with different formats

You need to specify the formats in c() in the best order to run it.

CodePudding user response：

Here is an answer without using any package for reference.

dateV=c(
'13-MAR-18',
'2018-08-05',
'13-Jun-19')

#Note: "pat and formatUsed" is a lookup in pair
pat=c("[0-9]{2}-\\w\\w\\w-[0-9]{2}$", "[0-9]{4}-[0-9]{2}-[0-9]{2}")
formatUsed=c("%d-%B-%y", "%Y-%m-%d")

betterDateV=NULL

for (tryP in 1:length(pat))
{
  dx=which(grepl(pat[tryP], dateV))
  betterDateV[dx] = as.Date( dateV[dx]   , format= formatUsed[tryP] )
}
as.Date(as.numeric(betterDateV))