I am trying to scrape tables from multiple urls. I am using the following code to scrape table from a single url:
library(tidyverse)
library(rvest)
url='https://uboat.net/allies/commanders/1.html'
read_html(url) %>%
html_element('table.table_subtle') %>%
html_table
However, I want to do so for 50 urls, numbered sequentially from 1-50. Is there a quick way to do this?
CodePudding user response:
There's almost always a quick way. You just have to look for a pattern in the urls. In your case, the urls are numbered starting 1 and working up to 50, so you can use a loop to construct the urls, like so (here's the first 5):
for(i in 1:5) {
url= paste0("https://uboat.net/allies/commanders/", i, ".html")
read_html(url) %>%
html_element('table.table_subtle') %>%
html_table %>%
print
}
# # A tibble: 2 × 4
# X1 X2 X3 X4
# <chr> <chr> <chr> <chr>
# 1 Born 3 Jun 1896 "" Plymouth, Devon, England
# 2 Died 9 Jul 1944 "(48)" Naval Hospital, Seaforth, Liverpool
# # A tibble: 2 × 4
# X1 X2 X3 X4
# <chr> <chr> <chr> <chr>
# 1 Born 26 Jan 1904 "" Dehra Dun, Uttarakhand, India
# 2 Died 23 May 1981 "(77)" Ashford, Kent, England
# # A tibble: 2 × 4
# X1 X2 X3 X4
# <chr> <chr> <chr> <chr>
# 1 Born 10 Jul 1901 "" Chicago, USA
# 2 Died 16 Jan 1977 "(75)" Bethesda Naval Medical Center
# # A tibble: 2 × 4
# X1 X2 X3 X4
# <chr> <chr> <chr> <chr>
# 1 Born 2 Feb 1901 "" Mayfair, London, England
# 2 Died 4 Mar 1997 "(96)" Bamburgh, Northumberland, England
# # A tibble: 2 × 4
# X1 X2 X3 X4
# <chr> <chr> <chr> <chr>
# 1 Born 22 Sep 1899 "" Quimby, Iowa, USA
# 2 Died 19 Mar 1945 "(45)" USS Franklin
to handle for cases where a url isn't available, you can use tryCatch()
to skip missing pages (e.g. the 11th page in this case):
for(i in 10:12) {
url= paste0("https://uboat.net/allies/commanders/", i, ".html")
tryCatch({
read_html(url) %>%
html_element('table.table_subtle') %>%
html_table %>%
print
}, error = function(e) {} )
}