Home > database >  web scraping in r with SelectorGadget
web scraping in r with SelectorGadget

Time:07-12

I was running this simple code below to scrape the employee number from this Fortune 500 page. I used the Chrome's extention: SelectorGadget to identify that the number I want matches with ".info__row--7f9lE:nth-child(13) .info__value--2AHH7"

library(rvest)
library(dplyr)
#download google chrome extention: SelectorGadget
link = "https://fortune.com/company/walmart/"
page = read_html(link)
Employees = page %>% html_nodes(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") %>% html_text()
Employees

However, it returned "character(0)". Does anyone know what is the cause? I feel it must be a simple mistake somewhere. Thanks in advance!

CodePudding user response:

When I do document.querySelectorAll(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") I see you want to scrape the # of employees. Maurits is right, looks like the data is downloaded as (inline) JSON and then rendered later. You can use Selenium to save the rendered page then apply your CSS selector there. Or you can extract the inline JSON and scrape it from there.

After some manual work, you can do the 2nd option like below in R 4.2.x

library(rvest)
library(jsonlite)

# R 4.1.x
sub2 <- function(x, pattern, replacement) sub(pattern, replacement, x = x, perl = TRUE)

url <- "https://fortune.com/company/walmart/"
json_data <- read_html(url) |>
  html_element("script#preload") |> 
  html_text() |>
  ## sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |> # R 4.2.x
  sub2("\\s*window\\.__PRELOADED_STATE__ = ", "") |>                       # R 4.1.x
  ## sub(";\\s*$", "", x = _, perl = TRUE) |>  # R 4.2.x
  sub2(";\\s*$", "") |>                        # R 4.1.x
  fromJSON(simplifyVector = FALSE)

page_data <- json_data$components$page[["/company/walmart/"]]

find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

info_data <- page_data |> 
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-information", "config")

info_data$employees
#> [1] "2300000"
  • Related