Home > other >  Convert heterogeneous date character vectors to date format (R)
Convert heterogeneous date character vectors to date format (R)

Time:09-17

I'm trying to convert a character column with dates to date format. However, the dates are in an ambiguous format. Some entries are of the format %d.%m.%Y (e.g., "03.02.2021"), while others are %d %b %Y (e.g., "3 Feb 2021").

I've tried as.Date(tryFormats=c("%d %b %Y", "%d.%m.%Y")), but realized that tryFormats is only flexible for the first entry, so that the entries of type %d %b %Y are correctly identified but those of %d.%m.%Y become NAs, or vice versa. I've also tried the anytime package, but that produced NAs in a similar fashion.

I've made sure that the column doesn't contain any NAs or empty strings, and I don't receive any error message.

CodePudding user response:

Try the parsedate package :

df <-read.table(header=TRUE,text=
"d
03.02.2021
'3 Feb 2021'
13/3/2021
13-3-2020")

df %>% mutate(date=parsedate::parse_date(d))
##           d       date
##1 03.02.2021 2021-02-03
##2 3 Feb 2021 2021-02-03
##3  13/3/2021 2021-03-13
##4  13-3-2020 2020-03-13

CodePudding user response:

Similar (but expanded) to Roland's suggestion, my answer here (in the (2) section) suggests a way to deal with multiple candidate formats.

## sample data
x <- c("03.02.2021", "3 Feb 2021")

formats <- c("%d.%m.%Y", "%d %b %Y")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
  nas <- is.na(dates)
  dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "2021-02-03" "2021-02-03"

It is better to have the most-frequent format first in the formats vector. One could add a quick-escape to the loop if there are many formats, such as

for (fmt in formats) {
  nas <- is.na(dates)
  if (!any(nas)) break
  dates[nas] <- as.Date(x[nas], format=fmt)
}

but I suspect that it really won't be very beneficial unless both formats and x are rather large (I have no sizes in mind to quantify "large").

CodePudding user response:

did you try lubridate ?

df <-read.table(header=TRUE,text=
                  "d
03.02.2021
'3 Feb 2021'
13/3/2021
13-3-2020")

dmy(df$d)

[1] "2021-02-03" "2021-02-03" "2021-03-13" "2020-03-13"

CodePudding user response:

Using anydate

library(anytime)
addFormats(c("%d/m/%Y", '%d-%m-%Y') )
anydate(df$d)
[1] "2021-02-03" "2021-02-03" "2021-03-13" "2020-03-13"
  • Related