I have a large vector of dates that has inconsistencies in date formats. Here is a sample of this vector
dates = c("2018-11-06", "11/6/2018", "11-6-2018")
I want to convert this vector to a consistent format of "2018-11-06".
I have tried the following solutions but they are very slow -
anytime::anydate(dates)
as.Date(lubridate::parse_date_time(dates, '%m-%d-%Y')) # Partially correct results
Please suggest a solution that is fast (since I have a vector with millions of dates) as well as correct.
Following is the performance benchmark -
microbenchmark::microbenchmark(
user2974951 = substr(parse_date_time(dates, c("%Y-%m-%d","%d/%m/%Y","%d/%m/%Y")), 1, 10),
user2974951_2 = substr(fast_strptime(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y")), 1, 10),
Wimpel = {
dates_1 = gsub("/", "-", dates)
dates_2 = ifelse(grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates_1),
gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates_1),
dates_1)
fasttime::fastDate(dates_2)
},
TarJae = substr(parse_date_time(dates, orders=c("ymd", "mdy")), 1, 10),
times = 10)
The solution by Wimpel is the fastest. The solution by user2974951_2 is easy to understand and quite fast, hence accepting it as the answer.
Unit: microseconds
expr min lq mean median uq max neval cld
user2974951_1 7070.939 7147.098 7385.8261 7238.3540 7520.453 8193.161 10 b
user2974951_2 319.894 355.736 404.0914 372.4270 415.670 707.338 10 a
Wimpel 112.448 130.324 148.4439 152.1875 159.463 190.922 10 a
TarJae 6123.001 6140.183 6959.6301 6277.1425 6838.099 12173.134 10 b
Many people have voted to close this question. Please do not close this question as the solution given here Why are my functions on lubridate dates so slow? has a solution only when the date format is consistent. In my question, the dates format is inconsistent.
CodePudding user response:
I believe this should work pretty fast:
create a uniform date-column, and the use the fasttime-package to convert
dates = c("2018-11-06", "11/6/2018", "11-6-2018")
library(data.table)
library(fasttime)
DT <- as.data.table(dates)
# replace all "/" charachters to "-"
DT[, dates2 := gsub("/", "-", dates)][]
# reverse strings startting with 4 digits
DT[grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates2),
dates2 := gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates2)][]
# usse fastposix to convert
DT[, dates3 := fasttime::fastDate(dates2)][]
# dates dates2 dates3
# 1: 2018-11-06 2018-11-06 2018-11-06
# 2: 11/6/2018 2018-11-6 2018-11-06
# 3: 11-6-2018 2018-11-6 2018-11-06
CodePudding user response:
Try
> lubridate::parse_date_time(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y"))
[1] "2018-11-06 UTC" "2018-06-11 UTC" "2018-06-11 UTC"
you can try as.character(...)
to remove the time zone, if you do not need it to be in date format.
Another option is
fast_strptime(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y"))