I am attempting to extract a table that spans multiple pages in an old website.
The site lists a series of bots by order of scores, good and bad votes, and link and comment karma. Preferably, I would like to extract the table by order of rank for all 318 pages, with the link https://botrank.pastimes.eu/?sort=rank&page=1 being an example of the first page.
The code that I tried was
pages <- seq(1:318)
bots <- lapply(pages, function(i){
url <- paste0("https://botrank.pastimes.eu/?sort=rank&page=", i)
webpage <- url %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
data <- webpage %>%
html_node("table") %>%
html_table() %>%
as_tibble()
colnames(data) = data[1,]
})
bots_table <- do.call(rbind, bots)
head(bots_table, n=10)
Which gives me a good, clean tibble, but only with the first result of each page. Here is the output below.
# A tibble: 318 × 7
Rank `Bot Name` Score Good Bo…¹ Bad B…² Comme…³ Link …⁴
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 KickOpenTheDoorBot 0.993 20,877 119 38,594 98,297
2 251 NinNinBot 0.921 45 0 47 1
3 501 RegularEality 0.859 99 8 0 0
4 751 BillyCloneasaurus 0.806 16 0 267,779 9,350
5 1,001 MamataBot 0.758 12 0 357 5
6 1,251 slashy_potato_mashy 0.703 33 6 0 0
7 1,501 jimmy-b-bot 0.667 45 12 14,531 151
8 1,751 related_threads 0.616 23 6 1,727 1
9 2,001 RemoveMeNot 0.567 15 4 13,595 2
10 2,251 python_boti 0.552 10 2 0 0
# … with 308 more rows, and abbreviated variable names
The website source code seems standard, so I'm not sure why this is happening. I am also fairly new on web scraping. Any suggestions would be great!
<table >
<tr>
<th>
<div style="margin: 1px" ></div>
<a href="/?sort=rank&order=reverse">Rank</a></th>
<th>
<a href="/?sort=name">Bot Name</a></th>
<th>
<a href="/?sort=score">Score</a></th>
<th><a href="/?sort=good-votes">Good Bot Votes</a></th>
<th>
<a href="/?sort=bad-votes">Bad Bot Votes</a></th>
<th>
<a href="/?sort=comment-karma">Comment Karma</a></th>
<th>
<a href="/?sort=link-karma">Link Karma</a></th>
</tr>
<tr>
<td>1</td>
<td><a href= https://www.reddit.com/user/KickOpenTheDoorBot>KickOpenTheDoorBot</a></td>
<td>0.9932</td>
<td>20,877</td>
<td>119</td>
<td>38,594</td>
<td>98,297</td>
</tr>
<tr>
<td>2</td>
<td><a href= https://www.reddit.com/user/Canna_Tips>Canna_Tips</a></td>
<td>0.992</td>
<td>18,045</td>
<td>121</td>
<td>49,670</td>
<td>1</td>
</tr>
CodePudding user response:
The following works. The main difference is to use html_elements
instead of html_node
.
suppressPackageStartupMessages({
library(rvest)
library(httr)
library(tidyverse)
})
pages <- 1:318
bots <- lapply(pages, function(i){
url <- paste0("https://botrank.pastimes.eu/?sort=rank&page=", i)
webpage <- url %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
data <- webpage %>%
html_elements("table") %>%
html_table() %>%
unlist(recursive = FALSE) %>%
as_tibble()
data
})
length(bots)
sapply(bots, dim)
Then rbind
them together.
bots_table <- do.call(rbind, bots)