I have to combine several spreadsheets for my project and for some reason they all decided to enter the dates differently. Below is a sample of my data with all of the different date formats I got after reading in and combining all of the spreadsheets.
date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018
I'm trying to standardize everything and turn it into a yyyy-mm-dd format. This is my attempt below.
library(lubridate)
parse_date_time(x = df$date,
orders = c("Y m d, "d m y", "d B Y", "m/d/y", ),
locale = "eng")
The problem is that the moment I reach a row with a different format, it stops working and just gives me NA's for the rest of the rows. How do I fix this?
CodePudding user response:
In my point of view there are 3 challenges.
First of all it is difficult to feed the orders
argument of parse_date_time
. I think after applying arrange we could get some control.
Second, the integer type of date must be handled separately I think.
Finally, it is not clear which is month in "2018-08-05" it could be 8 or 5.
library(lubridate)
library(dplyr)
df %>%
arrange(date) %>%
mutate(x = parse_date_time(date, orders=c("mdy", "dmy", "ymd")),
y = as.integer(str_extract(date, '^\\d $')),
y = as.Date(y, origin = "1899-12-30"),
x = coalesce(x,y)) %>%
select(date, new_date=x)
date new_date
1 10/16/2018 2018-10-16
2 11-Mar-2019 2019-11-20
3 13-MAR-18 2018-03-13
4 2018-08-05 2018-08-05
5 43423 2018-11-19
Warning message:
Problem while computing `x = parse_date_time(date, orders =
c("mdy", "dmy", "ymd"))`.
i 1 failed to parse.
CodePudding user response:
Edit:
I've just realised that the conversion for "43423" to "2022-07-27 16:02:42" is incorrect using this approach (2022-07-27 is today's date), as further explained in the docs: https://github.com/gaborcsardi/parsedate
For this solution to work, you would need to handle 'integer' date formats first, then parsedate()
the remaining dates. I think @TarJae's solution is the way to go.
Original answer:
Here is a potential solution using the parsedate package:
#install.packages("parsedate")
library(parsedate)
df <- read.table(text = "date
13-MAR-18
2018-08-05
43423
11-Mar-2019
10/16/2018", header = TRUE)
df$date_parsed <- parse_date(df$date)
df
#> date date_parsed
#> 1 13-MAR-18 2018-03-13 00:00:00
#> 2 2018-08-05 2018-08-05 00:00:00
#> 3 43423 2022-07-27 16:02:42
#> 4 11-Mar-2019 2019-03-11 00:00:00
#> 5 10/16/2018 2018-10-16 00:00:00
Created on 2022-07-27 by the reprex package (v2.0.1)
CodePudding user response:
You need
as.Date(x, tryFormats = c()
See:
Use as.Date with tryFormats to parse dates with different formats
You need to specify the formats in c() in the best order to run it.
CodePudding user response:
Here is an answer without using any package for reference.
dateV=c(
'13-MAR-18',
'2018-08-05',
'13-Jun-19')
#Note: "pat and formatUsed" is a lookup in pair
pat=c("[0-9]{2}-\\w\\w\\w-[0-9]{2}$", "[0-9]{4}-[0-9]{2}-[0-9]{2}")
formatUsed=c("%d-%B-%y", "%Y-%m-%d")
betterDateV=NULL
for (tryP in 1:length(pat))
{
dx=which(grepl(pat[tryP], dateV))
betterDateV[dx] = as.Date( dateV[dx] , format= formatUsed[tryP] )
}
as.Date(as.numeric(betterDateV))