Home > Net >  Is there an R function to parse vector containing dates in different formats?
Is there an R function to parse vector containing dates in different formats?

Time:10-05

I am working on a vector with 468006 elements and each element represents a date/time in one of two formats

A snippet of the vector is as follows:

> result_date_time_vector[1810:1820]

 [1] "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27"
 [7] "1/3/2021"            "2021-01-03 13:12:57" "2021-01-03 13:12:57" "2021-01-03 13:12:57" "2021-01-03 13:12:57"

> class(result_date_time_vector)
[1] "character"

I would like to remove the information about time and then convert the elements to a single consistent format.

I tried a for-loop and the process was very slow (but received no errors or warnings)

> fixed_result_date_time <- rep (NA, length(result_date_time_vector))
> class(fixed_result_date_time) <- "Date"
> for (n in 1:length(result_date_time_vector)){
  if (is.na(result_date_time_vector[n])){
    next
  } else if (str_detect(result_date_time_vector[n], "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}")){
    fixed_result_date_time[n] <- as_date(ymd_hms(result_date_time_vector[n], tz = "America/New_York"))
  } else {
    fixed_result_date_time[n] <- as_date(mdy(result_date_time_vector[n], tz = "America/New_York"))
  }
}

I also tried ifelse function and the process was quick (but received a lot of warnings).

> fixed_result_date_time <- ifelse(str_detect(result_date_time_vector, "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"),
                    as_date(ymd_hms(result_date_time_vector, tz = "America/New_York")),
                    as_date(mdy(result_date_time_vector, tz = "America/New_York")))

Warning:  20963 failed to parse.
Warning:  447043 failed to parse.

> class(fixed_result_date_time) <- "Date"

There were 20963 elements in the m/d/y format and 447043 elements in the y-m-d h:m:s format in the input vector.

Is there a more efficient method to accomplish the same without warnings?

CodePudding user response:

There is function parse_date_time in package lubridate:

orders <- c("ymd HMS", "mdy")
lubridate::parse_date_time(x, orders = orders)
# [1] "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC"
# [3] "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC"
# [5] "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC"
# [7] "2021-01-03 00:00:00 UTC" "2021-01-03 13:12:57 UTC"
# [9] "2021-01-03 13:12:57 UTC" "2021-01-03 13:12:57 UTC"
#[11] "2021-01-03 13:12:57 UTC"

Data

x<-'"2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27" "2021-01-03 02:22:27"
"1/3/2021"            "2021-01-03 13:12:57" "2021-01-03 13:12:57" "2021-01-03 13:12:57" "2021-01-03 13:12:57"'
x <- scan(text=x, what = character())

CodePudding user response:

Use the function anytime from anytime package:

anytime::anytime(dates)
 [1] "2021-01-03 02:22:27 PST" "2021-01-03 02:22:27 PST" "2021-01-03 02:22:27 PST"
 [4] "2021-01-03 02:22:27 PST" "2021-01-03 02:22:27 PST" "2021-01-03 02:22:27 PST"
 [7] "2021-01-03 00:00:00 PST" "2021-01-03 13:12:57 PST" "2021-01-03 13:12:57 PST"
[10] "2021-01-03 13:12:57 PST" "2021-01-03 13:12:57 PST"

DATA:

dates <- c("2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", 
"2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", 
"1/3/2021 ", "2021-01-03 13:12:57", "2021-01-03 13:12:57", "2021-01-03 13:12:57", 
"2021-01-03 13:12:57")

CodePudding user response:

You could also try the parsedate package:

> result_date_time_vector <- c("2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "1/3/2021", "2021-01-03 13:12:57", "2021-01-03 13:12:57", "2021-01-03 13:12:57", "2021-01-03 13:12:57")
> parsedate::parse_date(result_date_time_vector)
 [1] "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC"
 [4] "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC" "2021-01-03 02:22:27 UTC"
 [7] "2021-01-03 00:00:00 UTC" "2021-01-03 13:12:57 UTC" "2021-01-03 13:12:57 UTC"
[10] "2021-01-03 13:12:57 UTC" "2021-01-03 13:12:57 UTC"

CodePudding user response:

conv_dates <- function(dates, fmts = c("%Y-%m-%d", "%Y/%m/%d", "%d/%m/%Y", "%m/%d/%Y"), origin = "1900-01-01") {
  out <- Sys.Date()[rep(NA, length(dates))]
  notna0 <- !is.na(dates)
  allnum <- notna0 & grepl("^[.0-9] $", dates)
  if (any(allnum)) out[allnum] <- suppressWarnings(as.Date(as.numeric(dates[allnum]), origin = origin))
  for (fmt in fmts) {
    isna <- notna0 & is.na(out)
    if (!any(isna)) break
    out[isna] <- as.Date(dates[isna], format = fmt)
  }
  out
}

result_date_time_vector <- c("2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "2021-01-03 02:22:27", "1/3/2021", "2021-01-03 13:12:57", "2021-01-03 13:12:57", "2021-01-03 13:12:57", "2021-01-03 13:12:57")

conv_dates(gsub(" .*", "", result_date_time_vector))
#  [1] "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03" "0001-03-20" "2021-01-03"
#  [9] "2021-01-03" "2021-01-03" "2021-01-03"

CodePudding user response:

library(lubridate)

date(parse_date_time(vector, orders = c('ymd HMS', 'mdy')))
 [1] "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03"
 [7] "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03" "2021-01-03"
  • Related