I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?
CodePudding user response:
Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[@id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria. Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows