Home > Software design >  R datetime series missing values
R datetime series missing values

Time:12-10

I am converting from chr to POSIXct formatted as "%Y-%m-%d %H:%M:%S however, some of the times in the dataset do not have the seconds portion (%S) so when I convert to DateTime the times without the seconds are returned as empty cells - NA How do I make sure this does not happen. I want them all returned as DateTime regardless of if some time portions are missing or not?

The date-times that are in this format ("%Y-%m-%d %H:%M:%S")are returned correctly as POSIXct

But the date-times that are in this format ("%Y-%m-%d %H:%M") are returned as NA

This is the code is used for the conversion trips$ended_at <- as.POSIXct(trips$ended_at, format = "%Y-%m-%d %H:%M:%S") It is a huge dataset with over a million entries so i don't even know which datetimes don't have the seconds portion. Is there a way that those without the seconds' portion can just have zeroes and the end? for example, 2020-29-04 01:57 will be returned as 2020-29-04 01:57:00 when converted to POSIXct

Please help!

CodePudding user response:

Here's what I usually do. Check the length of the character strings prior to POSIXct formatting via nchar(), store the results in a new column, for example:

trips$check<-nchar(trips$ended_at)

Then check that all trips$ended_at have the same length and add the missing seconds for those who don't:

trips$ended_at_new<-ifelse(trips$check==19,trips$ended_at,paste(trips$ended_at,":00",sep=""))

You can swap out the 19 for whatever datetime format you are using. Important note: This only works, if seconds are missing at the end of your timestamp, not if the timestamp is less than 19 characters long for any other reason.

CodePudding user response:

This is the approach I took, using ifelse() under the assumption that you are dealing with two possibilities - with and without seconds

date_time <- c("2020-01-18 20:12:16", "2020-01-18 20:12")

ifelse(nchar(date_time) == 16, 
       format(as.POSIXct(date_time, format="%Y-%m-%d %H:%M"), "%Y-%m-%d %H:%M:%S"), 
       format(as.POSIXct(date_time, format="%Y-%m-%d %H:%M:%S"), "%Y-%m-%d %H:%M:%S"))

CodePudding user response:

Two methods:

  1. Concatenate the literal :00 onto the end of timestamps that only have hour/minute:

    as.POSIXct(trips$ended_at, format = "%Y-%m-%d %H:%M:%S")
    # [1] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [3] "2020-04-29 01:57:00 EDT" NA                       
    # [5] "2020-04-29 01:57:00 EDT"
    
    gsub("( [0-9] :[0-9] )$", "\\1:00", trips$ended_at)
    # [1] "2020-04-29 01:57:00" "2020-04-29 01:57:00" "2020-04-29 01:57:00"
    # [4] "2020-04-29 01:57:00" "2020-04-29 01:57:00"
    
    as.POSIXct(gsub("( [0-9] :[0-9] )$", "\\1:00", trips$ended_at), format = "%Y-%m-%d %H:%M:%S")
    # [1] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [3] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [5] "2020-04-29 01:57:00 EDT"
    
  2. If you have multiple "candidate" formats that you need to try, you can iteratively step through them. This loop iterates through the formats, place the most-likely candidates first. If at any point all timestamps have been converted, it early-exits the for loop.

    candidates <- c("%Y-%m-%d %H:%M", "%d/%m/%Y %H:%M:%S", "%d/%m/%Y %H:%M")
    out <- as.POSIXct(trips$ended_at, format = "%Y-%m-%d %H:%M:%S")
    for (fmt in candidates) {
      if (!length(isna <- is.na(out))) break
      out[isna] <- as.POSIXct(trips$ended_at[isna], format = fmt)
    }
    out
    # [1] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [3] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [5] "2020-04-29 01:57:00 EDT"
    

Data

trips <- data.frame(ended_at = c("2020-04-29 01:57:00", "2020-04-29 01:57:00", "2020-04-29 01:57:00", "2020-04-29 01:57", "2020-04-29 01:57:00"))
  • Related