I need to extract the text from a bunch of web pages that use JavaScript to render.
The code below usually works for me, resulting in just text and line returns which is fine.
However on some pages it doesn't work.
How can I use RSelenium to extract the text of the body of the "URL Fails" indicated webpage?
library("tidyverse")
library("rvest")
library("RSelenium")
remDr <- remoteDriver(port = 4445L)
remDr$open()
# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"
# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
pg <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
htm2txt::htm2txt()
remDr$close()
Proposed Solution by @NadPat
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
Result for me:
Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
For the failing URL something is being read because
remDr$getPageSource()[[1]]
returns:
[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...
Is there something wrong with how I have setup RSelenium with Docker?
=======================
UPDATE:
I pulled the latest version of standalone-firefox
from docker and now @NadPat's solutions work for me.
docker pull selenium/standalone-firefox:latest
CodePudding user response:
Launching the browser,
library(RSelenium)
driver = rsDriver(
port = 4841L,
browser = c("firefox"))
remDr <- driver[["client"]]
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
First method,
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce
Second Method,
text <- remDr$findElement(using = 'xpath', value = '//*[@id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL