Home > Software design >  Extracting a table that spans multiple pages
Extracting a table that spans multiple pages

Time:11-23

I am attempting to extract a table that spans multiple pages in an old website.

https://botrank.pastimes.eu/

The site lists a series of bots by order of scores, good and bad votes, and link and comment karma. Preferably, I would like to extract the table by order of rank for all 318 pages, with the link https://botrank.pastimes.eu/?sort=rank&page=1 being an example of the first page.

The code that I tried was

pages <- seq(1:318)

bots <- lapply(pages, function(i){
  url <- paste0("https://botrank.pastimes.eu/?sort=rank&page=", i)
  webpage <- url %>%
  httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
  read_html()
  data <- webpage %>%
    html_node("table") %>%
    html_table() %>%
    as_tibble()
  colnames(data) = data[1,]
})

bots_table <- do.call(rbind, bots)
head(bots_table, n=10)

Which gives me a good, clean tibble, but only with the first result of each page. Here is the output below.

# A tibble: 318 × 7
   Rank  `Bot Name`          Score Good Bo…¹ Bad B…² Comme…³ Link …⁴
   <chr> <chr>               <dbl> <chr>     <chr>   <chr>   <chr>  
 1 1     KickOpenTheDoorBot  0.993 20,877    119     38,594  98,297 
 2 251   NinNinBot           0.921 45        0       47      1      
 3 501   RegularEality       0.859 99        8       0       0      
 4 751   BillyCloneasaurus   0.806 16        0       267,779 9,350  
 5 1,001 MamataBot           0.758 12        0       357     5      
 6 1,251 slashy_potato_mashy 0.703 33        6       0       0      
 7 1,501 jimmy-b-bot         0.667 45        12      14,531  151    
 8 1,751 related_threads     0.616 23        6       1,727   1      
 9 2,001 RemoveMeNot         0.567 15        4       13,595  2      
10 2,251 python_boti         0.552 10        2       0       0      
# … with 308 more rows, and abbreviated variable names

The website source code seems standard, so I'm not sure why this is happening. I am also fairly new on web scraping. Any suggestions would be great!

<table >
  <tr>
    <th>
        <div style="margin: 1px" ></div>
      
      <a href="/?sort=rank&order=reverse">Rank</a></th>
    <th>
      <a href="/?sort=name">Bot Name</a></th>
    <th>
      <a href="/?sort=score">Score</a></th>
    <th><a href="/?sort=good-votes">Good Bot Votes</a></th>
    <th>
      <a href="/?sort=bad-votes">Bad Bot Votes</a></th>
    <th>
      <a href="/?sort=comment-karma">Comment Karma</a></th>
    <th>
      <a href="/?sort=link-karma">Link Karma</a></th>
  </tr>
  
  <tr>
    <td>1</td>
    <td><a href= https://www.reddit.com/user/KickOpenTheDoorBot>KickOpenTheDoorBot</a></td>
    <td>0.9932</td>
    <td>20,877</td>
    <td>119</td>
    <td>38,594</td>
    <td>98,297</td>
  </tr>
  
  <tr>
    <td>2</td>
    <td><a href= https://www.reddit.com/user/Canna_Tips>Canna_Tips</a></td>
    <td>0.992</td>
    <td>18,045</td>
    <td>121</td>
    <td>49,670</td>
    <td>1</td>
  </tr>

CodePudding user response:

The following works. The main difference is to use html_elements instead of html_node.

suppressPackageStartupMessages({
  library(rvest)
  library(httr)
  library(tidyverse)
})

pages <- 1:318

bots <- lapply(pages, function(i){
  url <- paste0("https://botrank.pastimes.eu/?sort=rank&page=", i)
  webpage <- url %>%
    httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
    read_html()
  data <- webpage %>%
    html_elements("table") %>%
    html_table() %>%
    unlist(recursive = FALSE) %>%
    as_tibble()
  data
})

length(bots)
sapply(bots, dim)

Then rbind them together.

bots_table <- do.call(rbind, bots)
  • Related