Home > OS >  Extract text from body of HTML page with RSelenium
Extract text from body of HTML page with RSelenium

Time:10-25

I need to extract the text from a bunch of web pages that use JavaScript to render.

The code below usually works for me, resulting in just text and line returns which is fine.

However on some pages it doesn't work.

How can I use RSelenium to extract the text of the body of the "URL Fails" indicated webpage?

library("tidyverse")
library("rvest")
library("RSelenium")

remDr <- remoteDriver(port = 4445L)
remDr$open()

# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"

# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"

remDr$navigate(url)

pg <-  
  remDr$getPageSource()[[1]] %>% 
  read_html(encoding = "UTF-8") %>% 
  html_node(xpath = "//body") %>%
  as.character() %>% 
  htm2txt::htm2txt()

remDr$close()

Proposed Solution by @NadPat

url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()

Result for me:

Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown

Error:   Summary: UnknownError
     Detail: An unknown server-side error occurred while processing the command.
     class: org.openqa.selenium.WebDriverException
     Further Details: run errorDetails method

For the failing URL something is being read because remDr$getPageSource()[[1]] returns:

[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...

Is there something wrong with how I have setup RSelenium with Docker?

=======================

UPDATE: I pulled the latest version of standalone-firefox from docker and now @NadPat's solutions work for me.

docker pull selenium/standalone-firefox:latest

CodePudding user response:

Launching the browser,

library(RSelenium)
driver = rsDriver(
     port = 4841L,
       browser = c("firefox"))

remDr <- driver[["client"]]

url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"

First method,

remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce

Second Method,

text <- remDr$findElement(using = 'xpath', value = '//*[@id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL
  • Related