I am currently trying to scrape a website with a combination of Rselenium
,rvest
, and the tidyverse
.
The goal is to go to this this website, click on one of the links (for instance, "Promo"), and then extract the entire table of data (e.g., card, and graded prices) using rvest
.
I was able to get the table extracted without too much of an issue using the following code:
library(RSelenium)
library(rvest)
library(tidyverse)
pokemon <- read_html("https://www.pricecharting.com/console/pokemon-promo")
price_table <- pokemon %>%
html_elements("#games_table") %>%
html_table()
However, this has a couple of issues: 1) I cannot go through all the different card sets on the inital website link I provided (https://www.pricecharting.com/category/pokemon-cards), and 2) I cannot extract the entire table with this method - only what is primarly loaded.
To mitigate these issues I was looking into Rselenium
. What I decided to do was go to the intial website, click on the link to a card set (e.g. "Promo"), and then load the entire page. This workflow can be shown here:
## open driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
## navigate to primary page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
## click on the link I want
remDr$findElement(using = "link text", "Promo")$clickElement()
## find the table
table <- remDr$findElement(using = "id", "games_table")
## load the entire table
table$sendKeysToElement(list(key = "end"))
## get the entire source
full_table <- remDr$getPageSource()[[1]]
## read in the table
html_page <- read_html(full_table)
## Do the `rvest` technique I had above.
html_page %>%
html_elements("#games_table") %>%
html_table()
However, my issue is that I am once again getting the same 51 elements instead of the entire table.
I am wondering if it is possible to combine my two techniques, and where in my coding process this is going wrong.
CodePudding user response:
I solved this issue. There were two things that were going on. The first is that the page was automatically loading with the cursor inside of a search bar. I got rid of this by doing remDr$findElement(using = "css", "body")$clickElement()
to click into the body of the text. Next, as one great question/answer pointed out, if the scrolling/arrow keys are not working with sendKeysToElement(list(key = "up_arrow"))
, you should try remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
.
Hence, the a small sample of my script is the following:
library(RSelenium)
library(rvest)
library(tidyverse)
## opens the driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
link_texts <- c("Base Set", "Promo", "Fossil")
## navigates to the correct page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
for (name in link_texts) {
## finds the link and clicks on it
remDr$findElement(using = "link text", name)$clickElement()
## gets the table path
remDr$findElement(using = "css", "body")$clickElement()
## finds the table - this line may be extraneous
table <- remDr$findElement(using = "css", "body")
## scrolls to the bottom of the table
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
Sys.sleep(1)
## get the entire page source that's been loaded
html <- remDr$getPageSource()[[1]]
## read in the page source
page <- read_html(html)
data_name <- str_to_lower(str_replace(name, " ","_"))
## extract the tabular table
df <- page %>%
html_elements("#games_table") %>%
html_table() %>%
pluck(1) %>%
select(1:4)
assign(data_name, df)
Sys.sleep(3)
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
}
## close driver
remDr$close()
rD$server$stop()
CodePudding user response:
The page wasn't scrolling down as the cursor by default is in search bar. So did a little modification to your code so it scrolls down completely.
#Launch browser
rD <- rsDriver(browser="firefox", port=9545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
remDr$findElement(using = "link text", "Promo")$clickElement()
#clicking outside the search bar
remDr$findElement(using = "xpath", value = '//*[@id="console-page"]')$clickElement()
webElem <- remDr$findElement("css", "body")
#looping to get at the end of the page.
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
#extract table
full_table <- remDr$getPageSource()[[1]]
html_page <- read_html(full_table)
html_page %>%
html_elements("#games_table") %>%
html_table()
[[1]]
# A tibble: 888 x 5
Card Ungraded `Grade 9` `PSA 10` ``
<chr> <chr> <chr> <chr> <chr>
1 Mew #8 $3.99 $38.79 $75.62 " Collection\n In One Click\n ~
2 Mewtwo #3 $8.28 $65.91 $227.50 " Collection\n In One Click\n ~
3 Charizard GX #SM211 $7.85 $23.64 $53.50 " Collection\n In One Click\n ~
4 Charizard V #SWSH050 $8.00 $34.99 $79.98 " Collection\n In One Click\n ~
5 Pikachu #24 $138.31 $362.72 $2,919.69 " Collection\n In One Click\n ~
6 Entei #34 $8.50 $52.21 $153.63 " Collection\n In One Click\n ~
7 Ancient Mew $23.79 $99.99 $382.50 " Collection\n In One Click\n ~
8 Charizard EX #XY121 $27.16 $135.00 $727.00 " Collection\n In One Click\n ~
9 Mewtwo EX #XY107 $5.54 $77.50 $98.71 " Collection\n In One Click\n ~
10 Charizard GX #SM60 $28.57 $113.98 $492.00 " Collection\n In One Click\n ~
# ... with 878 more rows