Home > database >  Working with difficult AM/PM formats and REGEX with lubridate in R
Working with difficult AM/PM formats and REGEX with lubridate in R

Time:02-14

Hello guys I hope everyone is having a good one, I am trying to work with some AM/PM formats in lubridate on R but I cant seem to come up with a proper solution I hope you guys can correct meand help me out please!

I have a HUGE dataset that has date_time formats in a very rare way the format goes as follow:

First a number that represents the day, second an abbreviation of the month OR even the month fully spelled out a 12H time format and the strings " a. m." OR "p. m." or even a combination of more spaces between or missing "dots" then such as "a. m" to set an example please take a look at this vector:

dates<-c("02 dec 05:47 a. m", 
"7 November 09:47 p. m.",
"3  jul 12:28 a.m.", 
"23 sept 08:53 a m.", 
"7 may 09:05 PM")

These make up for more than 95% of the rare formats of datetime in the data set I have been trying to use lubridate on R I am trying to use the function

ydm_hm(paste(2021,dates))

this is because all dates are form 2021 but I get always:

[1] NA                        NA                        NA                       
[4] NA                        "2021-05-07 21:05:00 UTC"
Warning message:
 4 failed to parse. 

the 4 that fail to parse give me NAS and the only one that parses is correct I do notice that this one has PM or AM as uppercase letters without dots but most of the time my formats will be like this:

ydm_hm("7 may 09:05 p.m.")

and this gives me NAS...

So I feel as though the only way to get this dates to workout is to change the structure and using REGEX so convert all "a. m." combinations into "AM" and "PM" only after analyzing the data I realized all "p.m" or "a. m." strings come after ONE or TWO spaces after the 12H time format that always have a length of 5 characters and so what should be considered to come up with the patter of the REGEX is the following

the string will begins with one or two numbers then spaces and then letters (for the month abbreviated or fully spelled out after that will have spaces and then 5 characters (that's the 12H time format) and then will have letters spaces and dots for all possible a.m and p.m formats but I have tried with no luck to convert the structure of the date.. if you guys could help me I will be so freaking thankful I dont know if there is a way or another package in R that will even resolve this issue without using regex so thank you everyone for your help !

my desired output will be:

"2021-12-02 05;47:00 UTC"
"2021-11-07 09:47:00 UTC"
"2021-07-03 12:28:00 UTC"
"2021-09-23 08:53:00 UTC"
"2021-05-07 21:05:00 UTC"

CodePudding user response:

In this case, parse_date from parsedate works

library(parsedate)
parse_date(paste(2021, dates))

-output

[1] "2021-12-02 05:47:00 UTC"
[2] "2021-11-07 09:47:00 UTC" 
[3] "2021-07-03 12:28:00 UTC"
[4] "2021-09-23 08:53:00 UTC" 
[5] "2021-05-07 21:05:00 UTC"

Or if the second value should be PM, use str_remove to remove the space

library(stringr)
parse_date(paste(2021, str_remove_all(dates,
    "(?<=[A-Za-z])[. ] (?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC" 
[2] "2021-11-07 21:47:00 UTC" 
[3] "2021-07-03 00:28:00 UTC" 
[4] "2021-09-23 08:53:00 UTC" 
[5] "2021-05-07 21:05:00 UTC"

With ydm_hm, the issue is that one of the am/pm format showed spaces without the . and this may not get parsed. We could change the format by removing the spaces

library(lubridate)
library(stringr)
ydm_hm(paste(2021, str_remove_all(dates,
    "(?<=[A-Za-z])[. ] (?=[A-Za-z])")))
[1] "2021-12-02 05:47:00 UTC" 
[2] "2021-11-07 21:47:00 UTC" 
[3] "2021-07-03 00:28:00 UTC" 
[4] "2021-09-23 08:53:00 UTC" 
[5] "2021-05-07 21:05:00 UTC"

CodePudding user response:

Since you raised the issue of regular expression, I thought I might try one way to do that

library(stringr)

# get boolean for pm dates
pm = str_detect(dates,"(?<=\\d\\d:\\d\\d\\s{1,2})[pP]",)

# convert dates to dates without am/pm
dates = str_extract(dates,"^.*:\\d\\d")

# add pm back to pm dates and am to am dates
dates[pm] <- paste(dates[pm], "PM")
dates[!pm] <- paste(dates[!pm], "AM")

# now your orignal approach works
ydm_hm(paste(2021,dates))

Output

[1] "2021-12-02 05:47:00 UTC" "2021-11-07 21:47:00 UTC" "2021-07-03 00:28:00 UTC" "2021-09-23 08:53:00 UTC"
[5] "2021-05-07 21:05:00 UTC"
  • Related