Home > Software design >  Convert ISO 8601 string to time format r
Convert ISO 8601 string to time format r

Time:09-18

Have a dataset yt_videos I scraped from YouTube

id Duration
01 PT5M28S
02 PT10M
03 PT1H2M21S
04 PT1H54M
05 PT1H27S

The duration column represents the length of the video in hour, minutes and seconds. I am trying to convert it to r time format and then to an integer of only seconds like

id Duration
01 328
02 600
03 3741
04 6840
05 7227

I have tried using the parse_ISO_8601_datetime() function but get the error:

Warning message:
In parse_ISO_8601_datetime(yt_videos$duration) : Invalid entries:

I also tried the anytime() function but it returns wrong results:

  [1] "1400-05-27 22:58:45 LMT" NA                        "1400-04-30 22:58:45 LMT" NA                       
  [5] NA                                               

What do I do?

CodePudding user response:

You can get duration in seconds like this:

dat <- c('PT5M28S', 'PT10M', 'PT1H2M21S', 'PT1H54M', 'PT1H27S')

# extract numeric parts for each unit
hms <- sapply(c('H', 'M', 'S'), function(unit) 
  sub(paste0('.*[^0-9] ([0-9] )', unit, '.*'), '\\1', dat))
#      H         M         S        
# [1,] "PT5M28S" "5"       "28"     
# [2,] "PT10M"   "10"      "PT10M"  
# [3,] "1"       "2"       "21"     
# [4,] "1"       "54"      "PT1H54M"
# [5,] "1"       "PT1H27S" "27"     

# change strings to numbers (non-numbers become NA)
suppressWarnings(mode(hms) <- 'numeric')
#       H  M  S
# [1,] NA  5 28
# [2,] NA 10 NA
# [3,]  1  2 21
# [4,]  1 54 NA
# [5,]  1 NA 27

# multiply by seconds (3600, 60, 1) and sum
colSums(t(hms) * 60^(2:0), na.rm=T)
# [1]  328  600 3741 6840 3627

CodePudding user response:

Someone may have a better answer, but here's my two cents.

get_seconds <- function(str = NA_character_){
  # first, remove time zone
  str <- substr(str, regexpr("(?=[0-9])", str, perl = T)[1], nchar(str))
  # next, get the hours, minutes, and seconds and add them to a vector
  time_vect <- c(hours = as.integer(str_extract(str, ".*(?=H)")),
                 minutes = as.integer(str_extract(str, "[0-9]{0,2}(?=M)")),
                 seconds = as.integer(str_extract(str, "[0-9]{0,2}(?=S)")))
  # replace NA valeus with 0
  time_vect <- purrr::map_int(time_vect, ~if_else(is.na(.), 0L, .))
  
  #then, return the time in seconds
  return(as.integer(60*60*time_vect[1]   60*time_vect[2]   time_vect[3]))
}

df$Duration <- lapply(df$Duration, get_seconds)

EDIT: after seeing Robert Hacken's answer here is a cleaner option of what I was suggesting

get_seconds <- function(str = NA_character_){
  time_vect <- purrr::map_int(c("H", "M", "S"),
                              ~as.integer(str_extract(str, paste0("[0-9]{0,2}(?=",.,")"))))
  return(sum(60*60*time_vect[1], 60*time_vect[2], time_vect[3], na.rm = T))
}

df$Duration <- lapply(df$Duration, get_seconds)
  • Related