I'm web-scraping using the R package rvest. I don't get an error, but instead, my code captures an empty character in the environment.
My code:
amore_tomato_page <- "https://thrivemarket.com/p/amore-tomato-paste"
amore_tomato <- read_html(amore_tomato_page)
amore_tomato_body <- amore_tomato %>%
html_node("body") %>%
html_children()
allergens <- amore_tomato %>%
html_nodes(xpath = '/html/body/div[1]/div[2]/div[4]/div[6]/div/div[1]/section/div/div/div/div/div[2]/div[2]/p[2]') %>%
html_attr()
ingredients <- amore_tomato %>%
html_nodes(xpath = '/html/body/div[1]/div[2]/div[4]/div[6]/div/div[1]/section/div/div/div/div/div[2]/div[1]/p') %>%
html_attr()
I'm trying to extract allergen information and ingredients for the product (and hundreds of products).
Thank you in advance for your help troubleshooting this!
Best,
~Mayra
CodePudding user response:
That data is loaded dynamically from a script tag containing a JSON string. You can extract that and deserialize into a JSON object with jsonlite and parse out the info of interest:
library(tidyverse)
library(rvest)
library(jsonlite)
amore_tomato_page <- "https://thrivemarket.com/p/amore-tomato-paste"
amore_tomato <- read_html(amore_tomato_page)
data <- amore_tomato %>% html_element('#__NEXT_DATA__') %>% html_text() %>% jsonlite::parse_json(simplifyVector = T)
allergy_info <- filter(data$props$pageProps$product$nutrition_info, friendly_label == 'Warning / Allergen Information')$value
ingredients <- data$props$pageProps$product$ingredients