I am trying to scrape the text box found at the bottom of this page under the "text" tab. However, I have spent a long time trying to figure how to do so but no luck so far. Here is my code:
link <- "https://exploreuk.uky.edu/catalog/xt7t1g0hx952#page/1/mode/1up"
page <- read_html(link)
text <- page %>% html_elements("#text_frame") %>% html_text()
I used gadget selector to select the text but I only get "" as the output. Can anyone please help me with this problem?
TIA
CodePudding user response:
This is dynamically rendered content, and it cannot be scraped with conventional html_elements
methods. Here is a way to get the JavaScript text with RSelenium
:
library(wdman)
library(RSelenium)
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '103.0.5060.134', # set to available
)
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
remDr$open()
link <- "https://exploreuk.uky.edu/catalog/xt7t1g0hx952#page/1/mode/1up"
remDr$navigate(link)
text_button <- remDr$findElement("xpath", "/html/body/div[1]/section[2]/div[2]/div[3]/div[1]/ul/li[2]")
remDr$mouseMoveToLocation(webElement = text_button)
text_button$click()
iframe <- remDr$findElement("xpath", "/html/body/div[1]/section[2]/div[2]/div[3]/div[3]/iframe")
remDr$switchToFrame(iframe)
all_text <- remDr$findElement("xpath", "html/body/pre")
all_text$getElementText()
#First few rows:
# . • . .·» ’ ' -· 7.\n4 4.,.- —— ..\"..`....,...;’:\"·.·»~——··_,__ ·\n4 . ..,....,·;;`,».—g;C3,:_·:r:;,;;;;;~...~.`.e;:~.g\n. ,`,:......... · >¢# fF$?Z;;r;:: wi Ti - g; r;r;:::.\"‘·;_~··· : \nmw.- ‘ ..-- :1g?;;::::\".-gg-_;c::§;§;;;z;r:r:9i;fi1:;:;::§§;‘};·—-..-2..•··\n : . -- -— ~.. j‘_‘&:&'i\"'”\"r:.x:‘:. r»=r§i?:`k~<·`¥,¥?£21iT3:!2§§3:::::&_—z.;:r§;§55:€iZ;;;;?£é§5;i—.;.·;,;5;>E;;;;>§;:$*/mx\n- · ‘ ··=:¢ J ¤· T7’¥:—r:<:7Z`·€:`???:1<¤`*?Z2€f!f¤T*§::1:1¤1‘?:1:=:fFi*’£:<:f¤fEE·:;:¢F§?é:=:: .-==;;$?k::¢¢¢§;>.;- T¥7;`?§ ;§~·~,;;:;:>;5;;;:;.·:3=¤‘7.;; 1;:::\n.·g · ;;;;;::i-··-~·:;::.(·7·~·;;::--—·;1;::r:··~-1;:.·»77~¢:;:1-·—;;::»·—·gn.-r·~;;;;:,~;·;:;:..·;;;:::-qv;·;r.·:-.-»·-;;;:;....—-··
You might have to do a little extra work setting up RSelenium
, such as installing a driver etc. Let me know if this works!
Here is a post which describes some of the switching to default content frame logic: