Home > front end >  Use a loop/automation for html web scraping
Use a loop/automation for html web scraping

Time:07-06

I am performing web-scraping in R (using rvest) for a specific set of data on various webpages. All of the webpages are formatted the same, so I can extract the targeted data from its placement on each page, using the correct node with no problem. However, there are 100 different web pages, all with the same url (except for the very end). Is there a way to use a loop to perform the process automatically?

I am using the following code:

webpage_urls <- paste0("https://exampleurl=", endings)

where endings is a vector of the 100 endings that give the separate webpages.

and then

htmltemplate <- read_html(webpage_urls)

however, I then receive Error: `x` must be a string of length 1

After this step, I would like to perform the follow extraction:

webscraping <- htmltemplate %>%
html_nodes("td") %>%
html_text()

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

result <- nth_element(webscraping, 10, 5) 

The code for extraction all works individually when I do it manually for each webpage, however I cannot repeat the function automatically for each webpage.

I am rather unfamiliar with loops/iteration and how to code it. Is there a way to run this extraction process for each webpage, and then to store the result of each extraction process to a separate vector, so that I can compile them in a table? If not a loop, is there another way to automate the process so that I can get past the error demanding a single string?

CodePudding user response:

nth_element <- function(vector, starting_position, n) {vector[seq(starting_position, length(vector), n)]}

allresults <- lapply(webpage_urls, function(oneurl) {
  read_html(oneurl) %>%
    html_nodes("td") %>%
    html_text() %>%
    nth_element(10, 5)
})
  • Related