Home > Net >  Use R Loop to Bulk Download Youtube Transcripts with youtubecaption
Use R Loop to Bulk Download Youtube Transcripts with youtubecaption

Time:03-05

I'm trying to use the youtubecaption library to download all the transcripts for a playlist then create a dataframe with all the results. I have a list of the video URLs and have tried to create a for loop to pass them into the get_caption() function. I can only get one video's transcripts added to the DF.

I've tried a few approaches:

vids <- as.list(mydata$videoId)

for (i in 1:length(vids)){
  vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
  test_transcript2 <-
    get_caption(
     url = vids2,
     language = "en",
     savexl = FALSE,
     openxl = FALSE,
     path = getwd())
  rbind(test_transcript, test_transcript2)
 }

Also using the column of the main dataframe: captions <- sapply(mydata[,24], FUN = get_captions)

Is there an efficient way to accomplish this?

CodePudding user response:

In your code, you do rbind(test_transcript, test_transcript2) but never assign it, so it is lost forever. When we combine that with my comment about not using the rbind(old, newrow) paradigm, your code might be

vids <- as.list(mydata$videoId)

out <- list()
for (i in 1:length(vids)){
  vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
  test_transcript2 <-
    get_caption(
     url = vids2,
     language = "en",
     savexl = FALSE,
     openxl = FALSE,
     path = getwd())
  out <- c(out, list(test_transcript2))
}
alldat <- do.call(rbind, out)

Some other pointers:

  • for (i in 1:length(.)) can be a bad practice if this is functionalized, it's better to use for (i in seq_along(vids))

  • we never need the index number itself, we can use for (vid in vids)

  • we can do the pasteing in one shot, generally faster for R, with for (vid in paste0("https://www.youtube.com/watch?v=", vids)), and then url=vid in the call to get_caption

  • with all that, it might be even simpler to use lapply for the whole thing:

    path <- getwd()
    out <- lapply(paste0("https://www.youtube.com/watch?v=", vids),
                  get_caption, language = "en", savexl = FALSE,
                  openxl = FALSE, path = path)
    do.call(rbind, out)
    

(NB: untested.)

  • Related