Home > Software engineering >  webscraping an embedded table with rvest
webscraping an embedded table with rvest

Time:05-22

I am learning how to webscrape with rvest and R and I want to extract the table embedded in the below website:

https://perfectunion.us/map-where-are-starbucks-workers-unionizing/

If you scroll midway through you will see an embedded table of starbucks stores and their unionize status.

When I use the CSS selector tool and highlight the table body, I get the code "td".

However when I use the below rvest code, I get:

{xml_nodeset (0)}

I have also used the inspection feature to see the table name (below) and I get the same error.

"table#wpgmza_table_1.responsive.wpgmza_table.dataTable.no-footer.dtr-inline.collapsed"

Can anyone help me extract that table into R? I am trying to do a science practice project

pacman::p_load(tidyverse,rvest)

url <- "https://perfectunion.us/map-where-are-starbucks-workers-unionizing/"

sb <- rvest::read_html(url)

#method1:
sb %>% 
  rvest::html_elements("td")

#method2
sb %>% 
rvest::html_elements("table#wpgmza_table_1.responsive.wpgmza_table.dataTable.no-footer.dtr-inline.collapsed")


I appreciate any help to ultimately extract that table from the website and bring into R as a table.

CodePudding user response:

It looks like the table is stored as JSON file. If you use the Network tab from the browser developers tool one can retrieve the link.

url<-"https://perfectunion.us/wp-json/wpgmza/v1/datatables/base64eJy10zFrwzAQBeD-8mYV6rZJQFvo0CWBDIFC4lKu1sUWlRVzkkPA L9HcVLo1qVa795907sBXdO9OgoBGu bt-VuWZZrkm WlQ3R rosl ZEvmKzpS-HUAiRJEI-Kjj2dWygHwqFlrpPa5JSpEh1dH3rk7kfYCjSlPbUctpfBSapmonTUXpWOIph T24RaAHnMj19zvhms-QB3KBx1H92EVG ymj-ZzRfslozzLa84z24v-tj-vZ1PRb66euG5tGoFDhGvkbUjhYF1nSw21IqE2vM4zjBWiiMh0"

jsonlite::fromJSON(url)

I'm not sure how stable this link is, it may change on a regular basis.

  • Related