Home > Software design >  Parse date column with mix of date formats
Parse date column with mix of date formats

Time:07-27

I have to combine several spreadsheets for my project and for some reason they all decided to enter the dates differently. Below is a sample of my data with all of the different date formats I got after reading in and combining all of the spreadsheets.

date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018

I'm trying to standardize everything and turn it into a yyyy-mm-dd format. This is my attempt below.

library(lubridate)
parse_date_time(x = df$date,
                orders = c("Y m d, "d m y", "d B Y", "m/d/y", ),
                locale = "eng")

The problem is that the moment I reach a row with a different format, it stops working and just gives me NA's for the rest of the rows. How do I fix this?

CodePudding user response:

In my point of view there are 3 challenges.

First of all it is difficult to feed the orders argument of parse_date_time. I think after applying arrange we could get some control.

Second, the integer type of date must be handled separately I think.

Finally, it is not clear which is month in "2018-08-05" it could be 8 or 5.

library(lubridate)
library(dplyr)
df %>% 
  arrange(date) %>% 
  mutate(x = parse_date_time(date, orders=c("mdy", "dmy", "ymd")),
         y = as.integer(str_extract(date, '^\\d $')),
         y = as.Date(y, origin = "1899-12-30"),
         x = coalesce(x,y)) %>% 
  select(date, new_date=x)
 date   new_date
1  10/16/2018 2018-10-16
2 11-Mar-2019 2019-11-20
3   13-MAR-18 2018-03-13
4  2018-08-05 2018-08-05
5       43423 2018-11-19
Warning message:
Problem while computing `x = parse_date_time(date, orders =
c("mdy", "dmy", "ymd"))`.
i  1 failed to parse. 

CodePudding user response:

Edit:

I've just realised that the conversion for "43423" to "2022-07-27 16:02:42" is incorrect using this approach (2022-07-27 is today's date), as further explained in the docs: https://github.com/gaborcsardi/parsedate

For this solution to work, you would need to handle 'integer' date formats first, then parsedate() the remaining dates. I think @TarJae's solution is the way to go.

Original answer:

Here is a potential solution using the parsedate package:

#install.packages("parsedate")
library(parsedate)

df <- read.table(text = "date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018", header = TRUE)

df$date_parsed <- parse_date(df$date)
df
#>          date         date_parsed
#> 1   13-MAR-18 2018-03-13 00:00:00
#> 2  2018-08-05 2018-08-05 00:00:00
#> 3       43423 2022-07-27 16:02:42
#> 4 11-Mar-2019 2019-03-11 00:00:00
#> 5  10/16/2018 2018-10-16 00:00:00

Created on 2022-07-27 by the reprex package (v2.0.1)

CodePudding user response:

You need

as.Date(x, tryFormats = c()

See:

Use as.Date with tryFormats to parse dates with different formats

You need to specify the formats in c() in the best order to run it.

CodePudding user response:

Here is an answer without using any package for reference.

dateV=c(
'13-MAR-18',
'2018-08-05',
'13-Jun-19')

#Note: "pat and formatUsed" is a lookup in pair
pat=c("[0-9]{2}-\\w\\w\\w-[0-9]{2}$", "[0-9]{4}-[0-9]{2}-[0-9]{2}")
formatUsed=c("%d-%B-%y", "%Y-%m-%d")

betterDateV=NULL

for (tryP in 1:length(pat))
{
  dx=which(grepl(pat[tryP], dateV))
  betterDateV[dx] = as.Date( dateV[dx]   , format= formatUsed[tryP] )
}
as.Date(as.numeric(betterDateV))
  • Related