web scrapping RSelenium findElement-CodePudding

I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/

I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""

library(RSelenium)
library(rvest)
library(netstat)

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[@]')
Employees

An error says

"Selenium message:no such element: Unable to locate element".

I have also tried:

Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')

But it returns the data not as wanted.

Can someone point out the problem? Really appreciate it!

CodePudding user response：

Does it have to be with RSelenium only? In my experience, the most flexible approach is to use RSelenium to navigate to the required pages (where findElement helps you find boxes to enter text into or buttons to click) and then use rvest to extract what you need from the page.

Start with

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])

How you then go about it depends on how specific you want the solution to be wrt this exact page. Here is one way:

rvest::html_elements(pg, "div.info__row--7f9lE") |> 
  rvest::html_text2()

rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |> 
  rvest::html_text2()

rvest::html_elements(pg, "div.info__row--7f9lE")[11] |> 
  rvest::html_children()

rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |> 
  rvest::html_text2()

et cetera. What you do in the rvest part would depend on how general you want the selection/extraction process to be.

CodePudding user response：

Use RSelenium to load up the webpage and get the page source

remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()

Use Rvest to read the contents of the webpage

pgCnt <- read_html(pgSrc[[1]])

Further, use rvest::html_nodes and rvest::html_text functions to extract the text using relevant xpath selectors. (this Chrome extension should help)

reqTxt <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

Output of reqTxt

> reqTxt
[1] "2,300,000"