Home > Back-end >  Attempting to scrape an "unscrapable" page?
Attempting to scrape an "unscrapable" page?

Time:01-28

I'm attempting to build a simple scraper, iterating through a website to pull two pieces of information and build myself a little reference list.

This is what the url looks like: "https://www.mtgstocks.com/prints/[[n]]"

The two pieces of information are the card name (Forbidden Alchemy) and card set (Innistrad).

Pretty straightforward, yeah? I thought so.

I attempted to pass any relevant anchors — css or xpath  — to try and isolate the two variables, but was met with "{xml_nodeset (0)}".

Here's the code that I ran:

# return page info
page_html <- read_html(httr::GET("https://www.mtgstocks.com/prints/1"))

# extract item name
page_html %>% 
  html_nodes("h3") %>%
  html_nodes("a") %>% 
  html_text()

# character(0)

I've scraped enough webpages to know that this information is being hidden, but I'm not exactly sure how. Would love help!

CodePudding user response:

This gets the first component. I save the HTML first to avoid repeatedly downloading the page while debugging.

library(tidyverse)
library(rvest)

filename <- file.choose()

page_html <- read_html(filename)

page_html %>% 
  html_nodes("h3") %>%
  html_nodes("a") %>% 
  html_text()
# [1] "Forbidden Alchemy"

CodePudding user response:

The first step on scraping web data should be to check where the page is pulling the data from and to see if it can be accessed directly rather than having to parse the HTML. In this case it can:

res <- jsonlite::fromJSON("https://api.mtgstocks.com/prints/1")

res$name
[1] "Forbidden Alchemy"
res$card_set$name
[1] "Innistrad"
  • Related