Home > Enterprise >  Scraping nested links
Scraping nested links

Time:06-17

I would like to scrape http://csla.history.ox.ac.uk/search.php after applying a filter as follows

  1. clicking on 'Saint'
  2. selecting 'Gaul and Frankish kingdoms' under 'Region of Birth/Burial'
  3. clicking on 'Apply Search'

I struggle as the URL does not get updated accordingly.

The source code with the <option value="Gaul">Gaul and Frankish kingdoms</option> looks as follows

<div  id="fl-page4-12">
<label for="item_12">Region of Birth/Burial</label>
<label >
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>

From the selected webpage, I would like to click on the IDs that are marked in blue (e.g. the first one is E06478).

From the then selected webpage (e.g. http://csla.history.ox.ac.uk/record.php?recid=E06478), I would like to click on the ID that is written in the table 'Related Saint Records' (e.g. the one here is S01319).

From the then selected webpage (e.g. http://csla.history.ox.ac.uk/record.php?recid=S01319), I would like to scrape the Saint ID (e.g. 'S01319'), Name (e.g. 'Orientius, bishop of Auch, 5th c.'), Reported Death Not Before, Reported Death Not After, Gender, Type of Saint and present them in a dataframe.

CodePudding user response:

As I'm aware that you have asked a similar question before, I'm going to continue the previously given solution

(The code at the beginning is copyied from this solution, in this extension, we create new columns for the additional data, and scrape them again using rvest)

library(httr)
library(rvest)

items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
              '\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
            "Content-Disposition: form-data; name=\"form[item_", items,
            ']\"\n', contents,
            collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')

type <- paste0('multipart/form-data; boundary=---------------------------',
               '39565121210000504382566389445')

res <- POST('http://csla.history.ox.ac.uk/results.php',
            body = charToRaw(s),
            content_type(type))

df <- res %>%
  read_html() %>%
  html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>%
  html_text() %>%
  matrix(ncol = 2, byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(c('ID', 'Title')) %>%
  dplyr::as_tibble()


urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)

all_results <- list()

for(i in seq_along(urls)) {
  all_results[[i]] <- read_html(urls[i]) %>%
    html_elements("td") %>%
    html_text() %>%
    matrix(ncol = 4, byrow = TRUE) %>%
    as.data.frame() %>%
    setNames(c("ID", "Name", "Name_in_source", "Identity"))
}

final_result <- dplyr::bind_rows(all_results)

# continued solution ----------------------

additional_columns <- c("Name", "Number in BH", "Reported Death Not Before", "Reported Death Not After", "Gender", "Type of Saint")
final_result[, additional_columns] <- NA

for (i in seq_along(final_result$ID)) {
  web_page <- read_html(paste0("http://csla.history.ox.ac.uk/record.php?recid=", final_result$ID[[i]]))
  temp_res <- sapply(additional_columns, function(col) web_page %>%
             html_element(xpath = paste0("//div[contains(text(),'", col, "')]")) %>%
             html_children() %>% html_text())
  final_result[i, additional_columns] <- lapply(temp_res, function(x) ifelse(!length(x), NA, x))
}
  • Related