Home > database >  Scraping webpage when filter does not change URL
Scraping webpage when filter does not change URL

Time:06-14

I would like to scrape http://csla.history.ox.ac.uk/search.php after applying a filter as follows

  1. clicking on 'Saint'
  2. selecting 'Gaul and Frankish kingdoms' under 'Region of Birth/Burial'
  3. clicking on 'Apply Search'

I struggle as the URL does not get updated accordingly.

The source code with the <option value="Gaul">Gaul and Frankish kingdoms</option>looks as follows

<div  id="fl-page4-12">
<label for="item_12">Region of Birth/Burial</label>
<label >
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>

From the selected webpage, I then would like to access the IDs that are written in blue, i.e. the first one would be E06478.

Any help would be very much appreciated!

CodePudding user response:

This is a tricky one. You need to POST the query to the server, and the query needs to be in a very particular format. You can get the html from the page like this:

library(httr)
library(rvest)

items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
              '\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
       "Content-Disposition: form-data; name=\"form[item_", items,
       ']\"\n', contents,
       collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')

type <- paste0('multipart/form-data; boundary=---------------------------',
               '39565121210000504382566389445')

res <- POST('http://csla.history.ox.ac.uk/results.php',
           body = charToRaw(s),
           content_type(type))

To get all the results in a neat data frame, you can then do:

df <- res %>% 
  read_html() %>% 
  html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>% 
  html_text() %>% 
  matrix(ncol = 2, byrow = TRUE) %>% 
  as.data.frame() %>% 
  setNames(c('ID', 'Title')) %>% 
  dplyr::as_tibble()

This gets you all the reference IDs in a data frame. To get the actual pages, we use these as query strings:

urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)

Now we need to go through all 900 pages to extract the tabular data. It's safest to do this in a loop then bind the list together at the end:

all_results <- list()

for(i in seq_along(urls)) {
  all_results[[i]] <- read_html(urls[i]) %>% 
                       html_elements("td") %>% 
                       html_text() %>%
                       matrix(ncol = 4, byrow = TRUE) %>%
                       as.data.frame() %>%
                       setNames(c("ID", "Name", "Name_in_source", "Identity"))
}

final_result <- dplyr::bind_rows(all_results)

The final result is now a data frame with over 3000 rows. Here are the first 3:

head(final_result, 3)
#>       ID                                       Name Name_in_source Identity
#> 1 S01319          Orientius, bishop of Auch, 5th c.                 Certain
#> 2 S02351 Mamertus, bishop of Vienne (Gaul), ob. 475                 Certain
#> 3 S00316                            Martyrs of Lyon                 Certain

Some of the IDs are duplicates since they appear in multiple pages. You could use unique to remove these. Note also that when you are printing a data frame to the console, Greek letters will appear as Unicode escape sequences. The text is still there in the underlying vector though. For example:

head(final_result[3])
#>                                                       Name_in_source
#> 1                                                                   
#> 2                                                                   
#> 3                                                                   
#> 4                                                                   
#> 5 <U 03A0><U 03BF><U 03BB><U 03CD><U 03BA>a<U 03C1>p<U 03BF><U 03C2>
#> 6           <U 03A0><U 03B9><U 03CC><U 03BD><U 03B9><U 03BF><U 03C2>

But

final_result[1:6, 3]
#> [1] ""          ""          ""          ""          "Πολύκαρπος" "Πιόνιος"  
  • Related