Home > Mobile >  webscraping from multiple urls in r
webscraping from multiple urls in r

Time:08-07

I am trying to scrape tables from multiple urls. I am using the following code to scrape table from a single url:

library(tidyverse)
library(rvest)

url='https://uboat.net/allies/commanders/1.html'

read_html(url) %>%
html_element('table.table_subtle') %>%
html_table

However, I want to do so for 50 urls, numbered sequentially from 1-50. Is there a quick way to do this?

CodePudding user response:

There's almost always a quick way. You just have to look for a pattern in the urls. In your case, the urls are numbered starting 1 and working up to 50, so you can use a loop to construct the urls, like so (here's the first 5):


for(i in 1:5) {
  
url= paste0("https://uboat.net/allies/commanders/", i, ".html")

read_html(url) %>% 
  html_element('table.table_subtle') %>% 
  html_table %>% 
  print
  
}

# # A tibble: 2 × 4
#   X1    X2         X3     X4                                 
#   <chr> <chr>      <chr>  <chr>                              
# 1 Born  3 Jun 1896 ""     Plymouth, Devon, England           
# 2 Died  9 Jul 1944 "(48)" Naval Hospital, Seaforth, Liverpool
# # A tibble: 2 × 4
#   X1    X2          X3     X4                           
#   <chr> <chr>       <chr>  <chr>                        
# 1 Born  26 Jan 1904 ""     Dehra Dun, Uttarakhand, India
# 2 Died  23 May 1981 "(77)" Ashford, Kent, England       
# # A tibble: 2 × 4
#   X1    X2          X3     X4                           
#   <chr> <chr>       <chr>  <chr>                        
# 1 Born  10 Jul 1901 ""     Chicago, USA                 
# 2 Died  16 Jan 1977 "(75)" Bethesda Naval Medical Center
# # A tibble: 2 × 4
#   X1    X2         X3     X4                               
#   <chr> <chr>      <chr>  <chr>                            
# 1 Born  2 Feb 1901 ""     Mayfair, London, England         
# 2 Died  4 Mar 1997 "(96)" Bamburgh, Northumberland, England
# # A tibble: 2 × 4
#   X1    X2          X3     X4               
#   <chr> <chr>       <chr>  <chr>            
# 1 Born  22 Sep 1899 ""     Quimby, Iowa, USA
# 2 Died  19 Mar 1945 "(45)" USS Franklin  

to handle for cases where a url isn't available, you can use tryCatch() to skip missing pages (e.g. the 11th page in this case):

for(i in 10:12) {
  
url= paste0("https://uboat.net/allies/commanders/", i, ".html")

  tryCatch({
    
    read_html(url) %>% 
    html_element('table.table_subtle') %>% 
    html_table %>% 
    print
    
  }, error = function(e) {} )
  
}
  • Related