I am trying to scrape the name list of the most famous brands provided by yougov using the package rvest and the tool selectorgadget. The tool works fine, but unfortunately R is only scraping the first 20 brand names, even though all brands are selected correctly by the selector gadget tool.
The R code I am using is the following:
# Packages
library("rvest")
library("dplyr")
# Scraping yougov-Data
yougov <- read_html("https://today.yougov.com/ratings/consumer/fame/brands/all")
yougov %>%
html_nodes("span:nth-child(3)") %>%
html_text()
I guess the problem is related to the fact that by default yougov shows only the first 20 brands. However, the selectorgadget code does not change, when you unfold the rest of the brands.
Thank you very much for your help!
CodePudding user response:
Check out the Developer Tools
in your browser, the Network
tab, and navigate the website again. You'll notice that the first 20 brands are loaded within the first HTML response. Then, when you press the "Load more" button, a new request will be sent that will load 20 more brands. From now on, as you scroll the page, new requests are sent, loading more and more brands.
You cannot replicate this behaviour with rvest
, UNLESS the website provides an API and you replace the website's URL with the API endpoint to get all this data.
If you want to continue with the web scraping option, you should consider RSelenium (or any other R library that automates a browser) and perform de algorithm:
- send the initial GET request to the website URL
- click on the "Load More" button
- scroll down how many times you need (the list goes on for over 700 brands)
- get the data