Home > Enterprise >  Scraping a web table through multiple pages (some rows are missing)
Scraping a web table through multiple pages (some rows are missing)

Time:11-22

I'd like to scrape a table (containing information about 31,385 soldiers) from https://irelandsgreatwardead.ie/the-archive/ using rvest.

library(rvest)
library(dplyr)

page <- read_html(x = "https://irelandsgreatwardead.ie/the-archive/")    
table <- page             %>% 
  html_nodes("table")     %>%  
  html_table(fill = TRUE) %>%
  as.data.frame()

This works, but only for the first 10 soldiers. In the source code, I can only see the information for the first 10 soldiers either. Any help on how to obtain the rows with the other soldiers would be highly appreciated!

Thanks and have a great day!

CodePudding user response:

Here is the RSelenium solution,

You can loop through page extracting table and joining to the previous table.

First launch the browser,

  library(RSelenium)
    driver = rsDriver(browser = c("firefox"))
    remDr <- driver[["client"]]
    remDr$navigate(url)

PART 1: Extracting table from first page and storing in df,

df = remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table() 
df = df[[1]]
#removing last row which is non-esstential
df = df[-nrow(df),]

PART 2: Loop through pages 2 to 5

for(i in 2:5){ 
#Building xpath for each page
xp = paste0('//*[@id="table_1_paginate"]/span/a[', i, ']')
cc <- remDr$findElement(using = 'xpath', value = xp)
cc$clickElement()

# Three second gap is given for the webpage to load
Sys.sleep(3)
df1 = remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table() 
df1 = df1[[1]]
df1 = df1[-nrow(df1),]

#Joining previous table `df` and present table `df1`
df = rbind(df, df1)
}

PART 3: Loop through rest of the pages 6 to 628

The xpath of remaining pages remains the same. Thus we have to repeat this code block 623 times to get table from remaining pages.

for (i in 1:623) {
x = i
cc <- remDr$findElement(using = 'xpath', value = '//*[@id="table_1_paginate"]/span/a[4]')
cc$clickElement()
Sys.sleep(3)
df1 = remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table() 
df1 = df1[[1]]
df1 = df1[-nrow(df1),]
df = rbind(df, df1)
}

Now we have df with info of all soldiers.

CodePudding user response:

library(RSelenium)
driver = rsDriver(browser = c("firefox"))

remDr <- driver[["client"]]
url <- 'https://irelandsgreatwardead.ie/the-archive/'
remDr$navigate(url)

# Locate the next page link
webElem <- remDr$findElement(using = "css", value = "a[data-dt-idx='3'")

# Click that link
webElem$clickElement()

# Get that table
remDr$getPageSource()[[1]] %>% 
  read_html() %>%
  html_table()

Your for loop needs to start at a value of 3 (thats the second page!). On the second page it becomes 4, etc. BUT it never goes over 5. because of the way it is 'designed' so you'd loop 3:5 then at 5 keep it at 5 each time.

  • Related