Optimize web scraping with Rselenium-CodePudding

I am doing some web scraping on a dynamic webpage and would like to optimize the process since it is very slow. The webpage displays a series of sales with information and as one scrolls down more sales show up, although there is a finite number of sales. What I did is to increase the window size so it would load almost every sale without scrolling. However, this takes a while to load since there is a lot of information, and images. The information that I am extracting is the price, the asset name, and the link associated with the asset (when you click on the image).

My goal is to optimize this process as much as possible. One way to do so would be not to load the images since I don't need them, but I could not find a way to do so with Firefox.

Any improvement would be greatly appreciated.

library(RSelenium)
library(rvest)

url <- "https://cnft.io/marketplace?project=Boss Cat Rocket Club&sort=_id:-1&type=listing,offer"

exCap <- list("moz:firefoxOptions" = list(args = list('--headless'))) # Hide browser --headless
rD <- rsDriver(browser = "firefox", port = as.integer(sample(4000:4700, 1)),
               verbose = FALSE, extraCapabilities = exCap)
remDr <- rD[["client"]]
remDr$setWindowSize(30000, 30000)
remDr$navigate(url)
Sys.sleep(300)
html <- remDr$getPageSource()[[1]]
remDr$close()

html <- read_html(html)

CodePudding user response：

Well, after some digging through that website, I found an API for all the listings: https://api.cnft.io/market/listings. It takes a POST request and will return paginated JSON strings. We can use httr to send such requests. Here is a small script for your web scrapping task.

api_link <- "https://api.cnft.io/market/listings"
project <- "Boss Cat Rocket Club"

query <- function(page, url, project) {
  httr::content(httr::POST(
    url = url, 
    body = list(
      search = "", 
      types = c("listing", "offer"), 
      project = project, 
      sort = list(`_id` = -1L), 
      priceMin = NULL, 
      priceMax = NULL, 
      page = page, 
      verified = TRUE, 
      nsfw = FALSE, 
      sold = FALSE, 
      smartContract = FALSE
    ), 
    encode = "json"
  ), simplifyVector = TRUE)
}

query_all <- function(url, project) {
  n <- query(1L, url, project)[["count"]]
  out <- vector("list", n)
  for (i in seq_len(n)) {
    out[[i]] <- query(i, url, project)[["results"]]
    if (length(out[[i]]) < 1L)
      return(out[seq_len(i - 1L)])
  }
  out
}

collect_data <- function(results) {
  dplyr::tibble(
    asset_id = results[["asset"]][["assetId"]],
    price = results[["price"]],
    link = paste0("https://cnft.io/token/", results[["_id"]])
  )
}

system.time(
  dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()  
)
dt

Output (it takes about 12 seconds to finish)

> system.time(
    dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()  
  )
   user  system elapsed 
   0.78    0.00   12.33 
> dt
# A tibble: 2,161 x 3
   asset_id                     price link                                          
   <chr>                        <dbl> <chr>                                         
 1 BossCatRocketClub1373    222000000 https://cnft.io/token/61ce22eb4185f57d50190079
 2 BossCatRocketClub4639    380000000 https://cnft.io/token/61ce229b9163f2db80db98fe
 3 BossCatRocketClub5598    505000000 https://cnft.io/token/61ce22954185f57d5018e2ff
 4 BossCatRocketClub2673    187000000 https://cnft.io/token/61ce2281ceed93ea12ae32ec
 5 BossCatRocketClub1721    350000000 https://cnft.io/token/61ce2281398627cc52c5844c
 6 BossCatRocketClub673     300000000 https://cnft.io/token/61ce22724185f57d5018d645
 7 BossCatRocketClub5915 200000000000 https://cnft.io/token/61ce2241398627cc52c56eae
 8 BossCatRocketClub5699    350000000 https://cnft.io/token/61ce21fa398627cc52c55644
 9 BossCatRocketClub4570    350000000 https://cnft.io/token/61ce21ef4185f57d5018a9d4
10 BossCatRocketClub6125    250000000 https://cnft.io/token/61ce21e49163f2db80db58dd
# ... with 2,151 more rows