Home > Blockchain >  How can I extract data using xpath and the rvest function html_nodes()?
How can I extract data using xpath and the rvest function html_nodes()?

Time:06-05

I'm web-scraping using the R package rvest. I don't get an error, but instead, my code captures an empty character in the environment.

My code:

amore_tomato_page <- "https://thrivemarket.com/p/amore-tomato-paste"
amore_tomato <- read_html(amore_tomato_page)
amore_tomato_body <- amore_tomato %>%
  html_node("body") %>%
  html_children()

allergens <- amore_tomato %>%
  html_nodes(xpath = '/html/body/div[1]/div[2]/div[4]/div[6]/div/div[1]/section/div/div/div/div/div[2]/div[2]/p[2]') %>%
  html_attr()

ingredients <- amore_tomato %>%
  html_nodes(xpath = '/html/body/div[1]/div[2]/div[4]/div[6]/div/div[1]/section/div/div/div/div/div[2]/div[1]/p') %>%
  html_attr()

I'm trying to extract allergen information and ingredients for the product (and hundreds of products).

Thank you in advance for your help troubleshooting this!

Best,
~Mayra

CodePudding user response:

That data is loaded dynamically from a script tag containing a JSON string. You can extract that and deserialize into a JSON object with jsonlite and parse out the info of interest:

library(tidyverse)
library(rvest)
library(jsonlite)

amore_tomato_page <- "https://thrivemarket.com/p/amore-tomato-paste"
amore_tomato <- read_html(amore_tomato_page)
data <- amore_tomato %>% html_element('#__NEXT_DATA__') %>% html_text() %>% jsonlite::parse_json(simplifyVector = T)
allergy_info <- filter(data$props$pageProps$product$nutrition_info, friendly_label == 'Warning / Allergen Information')$value
ingredients <- data$props$pageProps$product$ingredients
  • Related