I'm trying to use the youtubecaption library to download all the transcripts for a playlist then create a dataframe with all the results. I have a list of the video URLs and have tried to create a for loop to pass them into the get_caption() function. I can only get one video's transcripts added to the DF.
I've tried a few approaches:
vids <- as.list(mydata$videoId)
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
rbind(test_transcript, test_transcript2)
}
Also using the column of the main dataframe: captions <- sapply(mydata[,24], FUN = get_captions)
Is there an efficient way to accomplish this?
CodePudding user response:
In your code, you do rbind(test_transcript, test_transcript2)
but never assign it, so it is lost forever. When we combine that with my comment about not using the rbind(old, newrow)
paradigm, your code might be
vids <- as.list(mydata$videoId)
out <- list()
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
out <- c(out, list(test_transcript2))
}
alldat <- do.call(rbind, out)
Some other pointers:
for (i in 1:length(.))
can be a bad practice if this is functionalized, it's better to usefor (i in seq_along(vids))
we never need the index number itself, we can use
for (vid in vids)
we can do the
paste
ing in one shot, generally faster for R, withfor (vid in paste0("https://www.youtube.com/watch?v=", vids))
, and thenurl=vid
in the call toget_caption
with all that, it might be even simpler to use
lapply
for the whole thing:path <- getwd() out <- lapply(paste0("https://www.youtube.com/watch?v=", vids), get_caption, language = "en", savexl = FALSE, openxl = FALSE, path = path) do.call(rbind, out)
(NB: untested.)