Home > Software engineering >  Scraping book table from goodreads
Scraping book table from goodreads

Time:01-15

I'm attempting to scrape a table of read books from the Goodreads website using rvest. The data is formatted as a table, however I am getting errors when attempting to extract this info.

First we load some packages and set the url to scrape

library(dplyr)
library(rvest)

url <- "https://www.goodreads.com/review/list/4622890?shelf=read"

Running this code:

dat <- read_html(url) %>% 
  html_nodes('//*[@id="booksBody"]') %>% 
  html_table()

Produces: Error in tokenize(css) : Unexpected character '/' found at position 1

Trying it again, but without the first /:

dat <- read_html(url) %>% 
  html_nodes('/*[@id="booksBody"]') %>% 
  html_table()

Produces: Error in parse_simple_selector(stream) : Expected selector, got <EOF at 20>

And finally, just trying to get the table directly, without the intermediate call to html_nodes:

dat <- read_html(url) %>% 
  html_table('/*[@id="booksBody"]')

Produces: Error in if (header) { : argument is not interpretable as logical

Would appreciate any guidance on how to scrape this table

CodePudding user response:

I can get the first 30 books using this -

library(dplyr)
library(rvest)

url <- "https://www.goodreads.com/review/list/4622890?shelf=read"

book_table <- read_html(url) %>% 
  html_elements('table#books') %>%
  html_table() %>%
  .[[1]]

book_table

There is some cleaning that you might need to do in the data captured. Moreover, to get the complete list I am afraid rvest would not be enough. You might need to use something like RSelenium to scroll through the list.

CodePudding user response:

Scraping the first 5 pages

library(tidyverse)
library(rvest)
library(httr2)

get_books <- function(page) {
  cat("Scraping page:", page, "\n")
  books <-
    str_c("https://www.goodreads.com/review/list/4622890-emily-may?page=", page,
          "&shelf=#ALL#") %>%
    read_html() %>%
    html_elements(".bookalike.review")
  
  tibble(
    title = books %>%
      html_elements(".title a") %>%
      html_text2(),
    author = books %>%
      html_elements(".author a") %>%
      html_text2(),
    rating = books %>%
      html_elements(".avg_rating .value") %>%
      html_text2() %>%
      as.numeric(),
    date = books %>%
      html_elements(".date_added .value") %>%
      html_text2() %>%
      lubridate::mdy()
  )
}

df <- map_dfr(0:5, get_books)

# A tibble: 180 x 4
   title                                 author rating date      
   <chr>                                 <chr>   <dbl> <date>    
 1 Sunset                                "Cave~   4.19 2023-01-14
 2 Green for Danger (Inspector Cockrill~ "Bran~   3.84 2023-01-12
 3 Stone Cold Fox                        "Crof~   4.22 2023-01-12
 4 What If I'm Not a Cat?                "Wint~   4.52 2023-01-10
 5 The Prisoner's Throne (The Stolen He~ "Blac~   4.85 2023-01-07
 6 The Kind Worth Saving (Henry Kimball~ "Swan~   4.13 2023-01-06
 7 Girl at War                           "Novi~   4    2022-12-29
 8 If We Were Villains                   "Rio,~   4.23 2022-12-29
 9 The Gone World                        "Swet~   3.94 2022-12-28
10 Batman: The Dark Knight Returns       "Mill~   4.26 2022-12-28
# ... with 170 more rows
# i Use `print(n = ...)` to see more rows
  • Related