I'm attempting to scrape a table of read books from the Goodreads website using rvest
. The data is formatted as a table, however I am getting errors when attempting to extract this info.
First we load some packages and set the url to scrape
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
Running this code:
dat <- read_html(url) %>%
html_nodes('//*[@id="booksBody"]') %>%
html_table()
Produces: Error in tokenize(css) : Unexpected character '/' found at position 1
Trying it again, but without the first /
:
dat <- read_html(url) %>%
html_nodes('/*[@id="booksBody"]') %>%
html_table()
Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>
And finally, just trying to get the table directly, without the intermediate call to html_nodes
:
dat <- read_html(url) %>%
html_table('/*[@id="booksBody"]')
Produces: Error in if (header) { : argument is not interpretable as logical
Would appreciate any guidance on how to scrape this table
CodePudding user response:
I can get the first 30 books using this -
library(dplyr)
library(rvest)
url <- "https://www.goodreads.com/review/list/4622890?shelf=read"
book_table <- read_html(url) %>%
html_elements('table#books') %>%
html_table() %>%
.[[1]]
book_table
There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest
would not be enough. You might need to use something like RSelenium to scroll through the list.
CodePudding user response:
Scraping the first 5 pages
library(tidyverse)
library(rvest)
library(httr2)
get_books <- function(page) {
cat("Scraping page:", page, "\n")
books <-
str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
"&shelf=#ALL#") %>%
read_html() %>%
html_elements(".bookalike.review")
tibble(
title = books %>%
html_elements(".title a") %>%
html_text2(),
author = books %>%
html_elements(".author a") %>%
html_text2(),
rating = books %>%
html_elements(".avg_rating .value") %>%
html_text2() %>%
as.numeric(),
date = books %>%
html_elements(".date_added .value") %>%
html_text2() %>%
lubridate::mdy()
)
}
df <- map_dfr(0:5, get_books)
# A tibble: 180 x 4
title author rating date
<chr> <chr> <dbl> <date>
1 Sunset "Cave~ 4.19 2023-01-14
2 Green for Danger (Inspector Cockrill~ "Bran~ 3.84 2023-01-12
3 Stone Cold Fox "Crof~ 4.22 2023-01-12
4 What If I'm Not a Cat? "Wint~ 4.52 2023-01-10
5 The Prisoner's Throne (The Stolen He~ "Blac~ 4.85 2023-01-07
6 The Kind Worth Saving (Henry Kimball~ "Swan~ 4.13 2023-01-06
7 Girl at War "Novi~ 4 2022-12-29
8 If We Were Villains "Rio,~ 4.23 2022-12-29
9 The Gone World "Swet~ 3.94 2022-12-28
10 Batman: The Dark Knight Returns "Mill~ 4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows