Read_html returning “Error in read_xml.raw([...]) : Failed to parse text” while web scraping multipl-CodePudding

I'm trying to scrape the information about the nurse jobs on that link: https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse Sister Matron&logic=OR

I managed to do it on the first page of results. But when I try to do it on the other few hundreds pages, read_html() doesn't work anymore.

The first page works perfectly fine:

install.packages("rvest")
install.packages("dplyr")

library(rvest)
library(dplyr)

link = "https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse Sister Matron&logic=OR"
page = read_html(link)

But then for the following code I get the error message: Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, : Failed to parse text

link = "https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=2"
page = read_html(link)

Could you please tell me where I'm wrong when I scrape the second page of results? Thanks

CodePudding user response：

If you want to scrape a few hundred pages with an easy pagination structure (next page button), you might be better off using something like RSelenium to automate the clicking and scraping process. A clever trick for XPaths is Google Chrome -> Inspect -> Right Click on Code -> Copy XPath, you can do that for the next page button. Previous iterations of this issue have encoding errors, but the encoding for this site is UTF-8, and it doesn't work even if that is specified. This means that the site is probably in JavaScript, which further signifies that the best approach is Selenium. Alternatively, if the coding is too difficult you can use Octoparse, a free tool for Webscraping that makes pagination loops easy.

CodePudding user response：

You maybe able to create a session and then jump from page to page:

library(rvest)

s<- session("https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse Sister Matron&logic=OR")

link = "https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=2"
#jump to next page
s <- session_jump_to(s, link)
page = read_html(s2)
page %>% html_elements("div.vacancy")

session_history(s1). #display history

This should work, but I have not fully tested it to verify.

CodePudding user response：

Here I scraped from page 2 to 100 without any error. It should work for the 362 pages available. The code is inspired from the answer of @Dave2e.

library(tidyverse)
library(rvest)
library(httr2)

ses <-
  "https://www.jobs.nhs.uk/xi/search_vacancy/?action=search&staff_group=SG40&keyword=Nurse Sister Matron&logic=OR" %>%
  session()

n_pages <- page %>%
  html_element("li:nth-child(10) a") %>%
  html_text2() %>%
  as.numeric()

get_info <- function(index_page) {
  cat("Scraping page", index_page, "...", "\n")
  page <- session_jump_to(ses,
                          paste0("https://www.jobs.nhs.uk/xi/search_vacancy?action=page&page=", 
                                 index_page)) %>%
    read_html()
  
  tibble(
    from_page = index_page, 
    position = page %>%
      html_elements("h2 a") %>%
      html_text2(),
    practice = page %>%
      html_elements(".vacancy h3") %>%
      html_text2(),
    salary = page %>%
      html_elements(".salary") %>%
      html_text2(),
    type = page %>%
      html_elements(".left dl~ dl  dl dd") %>%
      html_text2()
  )
}

df <- map_dfr(2:100, get_info)

# A tibble: 1,980 × 5
   from_page position                             practice  salary type 
       <int> <chr>                                <chr>     <chr>  <chr>
 1         2 Practice Nurse or Nurse Practitioner General … Depen… Perm…
 2         2 Practice Nurse                       General … Depen… Perm…
 3         2 Practice Nurse                       General … Depen… Perm…
 4         2 Practice Nurse                       General … Depen… Perm…
 5         2 Practice Nurse                       General … Depen… Perm…
 6         2 Practice Nurse                       General … Depen… Perm…
 7         2 Practice Nurse                       General … Depen… Perm…
 8         2 Practice Nurse                       General … Depen… Perm…
 9         2 Practice Nurse                       General … Depen… Perm…
10         2 Staff Nurse                          Neurology £2565… Perm…
# … with 1,970 more rows