Home > Mobile >  Inserting string into the middle of a URL in R
Inserting string into the middle of a URL in R

Time:11-29

I am using rvest to scrape an IMDB list and want to access the list of full cast and crew. Unfortunately, IMDB has created a summary page when you click on the title and it takes me to the wrong page.

This is the webpage I get: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt

This is the webpage I need: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl

Notice the addition of the /fullcredits in the URL.

How can I insert /fullcredits into the middle of a URL I have built?

#install.packages("rvest")
#install.packages("dplyr")

library(rvest) #webscraping package
library(dplyr) #piping 

link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)


name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>%  html_attr("href") %>% paste("https://www.imdb.com", .,  sep="")

CodePudding user response:

Here is an option - get the dirname and basename from the link, replace the substring of the basename with new substring ("tt_ql_cl") and join them again with file.path after inserting the "fullcredits" in between

library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits", 
       str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))

-output

> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"
  • Related