Home > Blockchain >  Convert various different formats to yyyy-mm-dd
Convert various different formats to yyyy-mm-dd

Time:01-18

I have a large vector of dates that has inconsistencies in date formats. Here is a sample of this vector

dates = c("2018-11-06", "11/6/2018", "11-6-2018")

I want to convert this vector to a consistent format of "2018-11-06".

I have tried the following solutions but they are very slow -

  anytime::anydate(dates)
  as.Date(lubridate::parse_date_time(dates, '%m-%d-%Y')) # Partially correct results

Please suggest a solution that is fast (since I have a vector with millions of dates) as well as correct.

Following is the performance benchmark -

microbenchmark::microbenchmark(
  user2974951 = substr(parse_date_time(dates, c("%Y-%m-%d","%d/%m/%Y","%d/%m/%Y")), 1, 10),
  user2974951_2 = substr(fast_strptime(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y")), 1, 10),
  Wimpel = {
    dates_1 = gsub("/", "-", dates)
    dates_2 = ifelse(grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates_1),
                     gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates_1), 
                     dates_1)
    fasttime::fastDate(dates_2)
  },
  
  TarJae = substr(parse_date_time(dates, orders=c("ymd", "mdy")), 1, 10),
  times = 10)

The solution by Wimpel is the fastest. The solution by user2974951_2 is easy to understand and quite fast, hence accepting it as the answer.

Unit: microseconds
          expr      min       lq      mean    median       uq       max neval cld
 user2974951_1 7070.939 7147.098 7385.8261 7238.3540 7520.453  8193.161    10   b
 user2974951_2  319.894  355.736  404.0914  372.4270  415.670   707.338    10  a 
        Wimpel  112.448  130.324  148.4439  152.1875  159.463   190.922    10  a 
        TarJae 6123.001 6140.183 6959.6301 6277.1425 6838.099 12173.134    10   b

Many people have voted to close this question. Please do not close this question as the solution given here Why are my functions on lubridate dates so slow? has a solution only when the date format is consistent. In my question, the dates format is inconsistent.

CodePudding user response:

I believe this should work pretty fast:

create a uniform date-column, and the use the fasttime-package to convert

dates = c("2018-11-06", "11/6/2018", "11-6-2018")

library(data.table)
library(fasttime)
DT <- as.data.table(dates)
# replace all "/" charachters to "-"
DT[, dates2 := gsub("/", "-", dates)][]
# reverse strings startting with 4 digits
DT[grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates2), 
   dates2 := gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates2)][]
# usse fastposix to convert
DT[, dates3 := fasttime::fastDate(dates2)][]
#         dates     dates2     dates3
# 1: 2018-11-06 2018-11-06 2018-11-06
# 2:  11/6/2018  2018-11-6 2018-11-06
# 3:  11-6-2018  2018-11-6 2018-11-06

CodePudding user response:

Try

> lubridate::parse_date_time(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y"))
[1] "2018-11-06 UTC" "2018-06-11 UTC" "2018-06-11 UTC"

you can try as.character(...) to remove the time zone, if you do not need it to be in date format.

Another option is

fast_strptime(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y"))
  •  Tags:  
  • r
  • Related