Home > OS >  Cannot Web Scrape Text Box with R Studio Using Rvest
Cannot Web Scrape Text Box with R Studio Using Rvest

Time:08-25

I am trying to scrape the text box found at the bottom of this page under the "text" tab. However, I have spent a long time trying to figure how to do so but no luck so far. Here is my code:

link <- "https://exploreuk.uky.edu/catalog/xt7t1g0hx952#page/1/mode/1up"
page <- read_html(link)
text <- page %>% html_elements("#text_frame") %>% html_text()

I used gadget selector to select the text but I only get "" as the output. Can anyone please help me with this problem?

TIA

CodePudding user response:

This is dynamically rendered content, and it cannot be scraped with conventional html_elements methods. Here is a way to get the JavaScript text with RSelenium:

library(wdman)
library(RSelenium)

selServ <- selenium(
  port = 4444L,
  version = 'latest',
  chromever = '103.0.5060.134', # set to available
)

remDr <- remoteDriver(
  remoteServerAddr = 'localhost',
  port = 4444L,
  browserName = 'chrome'
)

remDr$open()

link <- "https://exploreuk.uky.edu/catalog/xt7t1g0hx952#page/1/mode/1up"

remDr$navigate(link)
text_button <- remDr$findElement("xpath", "/html/body/div[1]/section[2]/div[2]/div[3]/div[1]/ul/li[2]")
remDr$mouseMoveToLocation(webElement = text_button)
text_button$click()
iframe <- remDr$findElement("xpath", "/html/body/div[1]/section[2]/div[2]/div[3]/div[3]/iframe")
remDr$switchToFrame(iframe)
all_text <- remDr$findElement("xpath", "html/body/pre")
all_text$getElementText()

#First few rows:

# . • . .·»  ’ ' -· 7.\n4 4.,.- ——  ..\"..`....,...;’:\"·.·»~——··_,__ ·\n4 .  ..,....,·;;`,».—g;C3,:_·:r:;,;;;;;~...~.`.e;:~.g\n. ,`,:.........  · >¢# fF$?Z;;r;:: wi Ti - g; r;r;:::.\"‘·;_~··· : \nmw.- ‘  ..-- :1g?;;::::\".-gg-_;c::§;§;;;z;r:r:9i;fi1:;:;::§§;‘};·—-..-2..•··\n : . -- -—  ~.. j‘_‘&:&'i\"'”\"r:.x:‘:.  r»=r§i?:`k~<·`¥,¥?£21iT3:!2§§3:::::&_—z.;:r§;§55:€iZ;;;;?£é§5;i—.;.·;,;5;>E;;;;>§;:$*/mx\n- · ‘ ··=:¢ J ¤· T7’¥:—r:<:7Z`·€:`???:1<¤`*?Z2€f!f¤T*§::1:1¤1‘?:1:=:fFi*’£:<:f¤fEE·:;:¢F§?é:=::   .-==;;$?k::¢¢¢§;>.;- T¥7;`?§ ;§~·~,;;:;:>;5;;;:;.·:3=¤‘7.;; 1;:::\n.·g · ;;;;;::i-··-~·:;::.(·7·~·;;::--—·;1;::r:··~-1;:.·»77~¢:;:1-·—;;::»·—·gn.-r·~;;;;:,~;·;:;:..·;;;:::-qv;·;r.·:-.-»·-;;;:;....—-··

You might have to do a little extra work setting up RSelenium, such as installing a driver etc. Let me know if this works!

Here is a post which describes some of the switching to default content frame logic:

What does #document mean?

  • Related