Have a dataset yt_videos
I scraped from YouTube
id | Duration |
---|---|
01 | PT5M28S |
02 | PT10M |
03 | PT1H2M21S |
04 | PT1H54M |
05 | PT1H27S |
The duration column represents the length of the video in hour, minutes and seconds. I am trying to convert it to r time format and then to an integer of only seconds like
id | Duration |
---|---|
01 | 328 |
02 | 600 |
03 | 3741 |
04 | 6840 |
05 | 7227 |
I have tried using the parse_ISO_8601_datetime()
function but get the error:
Warning message:
In parse_ISO_8601_datetime(yt_videos$duration) : Invalid entries:
I also tried the anytime()
function but it returns wrong results:
[1] "1400-05-27 22:58:45 LMT" NA "1400-04-30 22:58:45 LMT" NA
[5] NA
What do I do?
CodePudding user response:
You can get duration in seconds like this:
dat <- c('PT5M28S', 'PT10M', 'PT1H2M21S', 'PT1H54M', 'PT1H27S')
# extract numeric parts for each unit
hms <- sapply(c('H', 'M', 'S'), function(unit)
sub(paste0('.*[^0-9] ([0-9] )', unit, '.*'), '\\1', dat))
# H M S
# [1,] "PT5M28S" "5" "28"
# [2,] "PT10M" "10" "PT10M"
# [3,] "1" "2" "21"
# [4,] "1" "54" "PT1H54M"
# [5,] "1" "PT1H27S" "27"
# change strings to numbers (non-numbers become NA)
suppressWarnings(mode(hms) <- 'numeric')
# H M S
# [1,] NA 5 28
# [2,] NA 10 NA
# [3,] 1 2 21
# [4,] 1 54 NA
# [5,] 1 NA 27
# multiply by seconds (3600, 60, 1) and sum
colSums(t(hms) * 60^(2:0), na.rm=T)
# [1] 328 600 3741 6840 3627
CodePudding user response:
Someone may have a better answer, but here's my two cents.
get_seconds <- function(str = NA_character_){
# first, remove time zone
str <- substr(str, regexpr("(?=[0-9])", str, perl = T)[1], nchar(str))
# next, get the hours, minutes, and seconds and add them to a vector
time_vect <- c(hours = as.integer(str_extract(str, ".*(?=H)")),
minutes = as.integer(str_extract(str, "[0-9]{0,2}(?=M)")),
seconds = as.integer(str_extract(str, "[0-9]{0,2}(?=S)")))
# replace NA valeus with 0
time_vect <- purrr::map_int(time_vect, ~if_else(is.na(.), 0L, .))
#then, return the time in seconds
return(as.integer(60*60*time_vect[1] 60*time_vect[2] time_vect[3]))
}
df$Duration <- lapply(df$Duration, get_seconds)
EDIT: after seeing Robert Hacken's answer here is a cleaner option of what I was suggesting
get_seconds <- function(str = NA_character_){
time_vect <- purrr::map_int(c("H", "M", "S"),
~as.integer(str_extract(str, paste0("[0-9]{0,2}(?=",.,")"))))
return(sum(60*60*time_vect[1], 60*time_vect[2], time_vect[3], na.rm = T))
}
df$Duration <- lapply(df$Duration, get_seconds)