I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/
I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[@]')
Employees
An error says
"Selenium message:no such element: Unable to locate element".
I have also tried:
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
But it returns the data not as wanted.
Can someone point out the problem? Really appreciate it!
CodePudding user response:
Does it have to be with RSelenium only? In my experience, the most flexible approach is to use RSelenium to navigate to the required pages (where findElement helps you find boxes to enter text into or buttons to click) and then use rvest to extract what you need from the page.
Start with
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])
How you then go about it depends on how specific you want the solution to be wrt this exact page. Here is one way:
rvest::html_elements(pg, "div.info__row--7f9lE") |>
rvest::html_text2()
or
rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |>
rvest::html_text2()
or
rvest::html_elements(pg, "div.info__row--7f9lE")[11] |>
rvest::html_children()
or
rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |>
rvest::html_text2()
et cetera. What you do in the rvest part would depend on how general you want the selection/extraction process to be.
CodePudding user response:
Use RSelenium
to load up the webpage and get the page source
remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()
Use Rvest
to read the contents of the webpage
pgCnt <- read_html(pgSrc[[1]])
Further, use rvest::html_nodes
and rvest::html_text
functions to extract the text using relevant xpath
selectors. (this Chrome extension should help)
reqTxt <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
Output of reqTxt
> reqTxt
[1] "2,300,000"