I am using rvest to scrape an IMDB list and want to access the list of full cast and crew. Unfortunately, IMDB has created a summary page when you click on the title and it takes me to the wrong page.
This is the webpage I get: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
This is the webpage I need: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
Notice the addition of the /fullcredits
in the URL.
How can I insert /fullcredits
into the middle of a URL I have built?
#install.packages("rvest")
#install.packages("dplyr")
library(rvest) #webscraping package
library(dplyr) #piping
link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)
name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>% html_attr("href") %>% paste("https://www.imdb.com", ., sep="")
CodePudding user response:
Here is an option - get the dirname
and basename
from the link, replace the substring of the basename
with new substring ("tt_ql_cl") and join them again with file.path
after inserting the "fullcredits" in between
library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits",
str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-output
> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"