I would like to scrape http://csla.history.ox.ac.uk/search.php after applying a filter as follows
- clicking on 'Saint'
- selecting 'Gaul and Frankish kingdoms' under 'Region of Birth/Burial'
- clicking on 'Apply Search'
I struggle as the URL does not get updated accordingly.
The source code with the <option value="Gaul">Gaul and Frankish kingdoms</option>
looks as follows
<div id="fl-page4-12">
<label for="item_12">Region of Birth/Burial</label>
<label >
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>
From the selected webpage, I then would like to access the IDs that are written in blue, i.e. the first one would be E06478
.
Any help would be very much appreciated!
CodePudding user response:
This is a tricky one. You need to POST
the query to the server, and the query needs to be in a very particular format. You can get the html from the page like this:
library(httr)
library(rvest)
items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
'\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
"Content-Disposition: form-data; name=\"form[item_", items,
']\"\n', contents,
collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')
type <- paste0('multipart/form-data; boundary=---------------------------',
'39565121210000504382566389445')
res <- POST('http://csla.history.ox.ac.uk/results.php',
body = charToRaw(s),
content_type(type))
To get all the results in a neat data frame, you can then do:
df <- res %>%
read_html() %>%
html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>%
html_text() %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
setNames(c('ID', 'Title')) %>%
dplyr::as_tibble()
This gets you all the reference IDs in a data frame. To get the actual pages, we use these as query strings:
urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)
Now we need to go through all 900 pages to extract the tabular data. It's safest to do this in a loop then bind the list together at the end:
all_results <- list()
for(i in seq_along(urls)) {
all_results[[i]] <- read_html(urls[i]) %>%
html_elements("td") %>%
html_text() %>%
matrix(ncol = 4, byrow = TRUE) %>%
as.data.frame() %>%
setNames(c("ID", "Name", "Name_in_source", "Identity"))
}
final_result <- dplyr::bind_rows(all_results)
The final result is now a data frame with over 3000 rows. Here are the first 3:
head(final_result, 3)
#> ID Name Name_in_source Identity
#> 1 S01319 Orientius, bishop of Auch, 5th c. Certain
#> 2 S02351 Mamertus, bishop of Vienne (Gaul), ob. 475 Certain
#> 3 S00316 Martyrs of Lyon Certain
Some of the IDs are duplicates since they appear in multiple pages. You could use unique
to remove these. Note also that when you are printing a data frame to the console, Greek letters will appear as Unicode escape sequences. The text is still there in the underlying vector though. For example:
head(final_result[3])
#> Name_in_source
#> 1
#> 2
#> 3
#> 4
#> 5 <U 03A0><U 03BF><U 03BB><U 03CD><U 03BA>a<U 03C1>p<U 03BF><U 03C2>
#> 6 <U 03A0><U 03B9><U 03CC><U 03BD><U 03B9><U 03BF><U 03C2>
But
final_result[1:6, 3]
#> [1] "" "" "" "" "Πολύκαρπος" "Πιόνιος"