Home > Software design >  Scraping Website with Unchanging URL in R
Scraping Website with Unchanging URL in R

Time:10-10

I would like to scrape a series of tables from a website whose URL does not change when I click through the tables in my browser. Each table corresponds to a unique date. The default table is that which corresponds to today's date. I can scroll through past dates in my browser, but can't seem to find a way to do so in R.

Using library(rvest) this bit of code will reliably download the table that corresponds to today's date (I'm only interested in the first of the three tables).

webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
  read_html()  %>%
  html_table()
off <- off[[1]]

How can I download the table that corresponds to, say "2022-10-04", to "2022-10-06", or to yesterday?

I've tried to work through it by identifying the node under which the table lies, in the hopes that I could manipulate it to reflect a prior date. However, the following reproduces the same table as above:

webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
  read_html() %>%
  html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
  html_table()
off <- off[[1]]

Scrolling through past dates in my browser, I've identified various places in the html that reference the prior date; but I can't seem to change it from R, yet alone get the table I download to reflect a change:

webad %>%
  read_html() %>%
  html_nodes("#main > div > section:nth-child(1) > article > header > div")

I've messed around some with html_form(), follow_link(), and set_values() also, but to no avail.

Is there a good way to navigate this particular URL in R?

CodePudding user response:

You can consider the following approach :

library(RSelenium)
library(rvest)

port <- as.integer(4444L   rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client

url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)

web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()

web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()

web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()

web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()

html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 5 x 5
  Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
  <chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       

[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names

[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names

[[4]]
# A tibble: 6 x 7
      S     M     T     W     T     F     S
  <int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA

CodePudding user response:

Here is another approach that can be considered :

library(RDCOMClient)
library(rvest)

url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()

clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)

web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)

web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"

web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)

html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 5 x 5
  Game                     `Official 1`            `Official 2`         `Official 3`          Alternate
  <chr>                    <chr>                   <chr>                <chr>                 <lgl>    
1 Indiana @ Charlotte      John Goble (#10)        Lauren Holtkamp (#7) Phenizee Ransom (#70) NA       
2 Cleveland @ Philadelphia Marc Davis (#8)         Jacyn Goble (#68)    Tyler Mirkovich (#97) NA       
3 Toronto @ Boston         Josh Tiven (#58)        Matt Boland (#18)    Intae hwang (#96)     NA       
4 Dallas @ Oklahoma City   Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91)   NA       
5 Phoenix @ L.A. Lakers    Bill Kennedy (#55)      Rodney Mott (#71)    Jenna Reneau (#93)    NA       

[[2]]
# A tibble: 8 x 7
  Game   `Official 1` `Official 2` `Official 3` Alternate   ``    ``   
  <chr>  <chr>        <chr>        <chr>        <chr>       <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate"  NA    NA  
2 "S"    "M"          "T"          "W"          "T"         "F"   "S"  
3 ""     ""           ""           ""           ""          ""    "1"  
4 "2"    "3"          "4"          "5"          "6"         "7"   "8"  
5 "9"    "10"         "11"         "12"         "13"        "14"  "15" 
6 "16"   "17"         "18"         "19"         "20"        "21"  "22" 
7 "23"   "24"         "25"         "26"         "27"        "28"  "29" 
8 "30"   "31"         ""           ""           ""          ""    ""   

[[3]]
# A tibble: 7 x 7
  Game  `Official 1` `Official 2` `Official 3` Alternate ``    ``   
  <chr> <chr>        <chr>        <chr>        <chr>     <chr> <chr>
1 "S"   "M"          "T"          "W"          "T"       "F"   "S"  
2 ""    ""           ""           ""           ""        ""    "1"  
3 "2"   "3"          "4"          "5"          "6"       "7"   "8"  
4 "9"   "10"         "11"         "12"         "13"      "14"  "15" 
5 "16"  "17"         "18"         "19"         "20"      "21"  "22" 
6 "23"  "24"         "25"         "26"         "27"      "28"  "29" 
7 "30"  "31"         ""           ""           ""        ""    ""   

[[4]]
# A tibble: 6 x 7
      S     M     T     W     T     F     S
  <int> <int> <int> <int> <int> <int> <int>
1    NA    NA    NA    NA    NA    NA     1
2     2     3     4     5     6     7     8
3     9    10    11    12    13    14    15
4    16    17    18    19    20    21    22
5    23    24    25    26    27    28    29
6    30    31    NA    NA    NA    NA    NA
  • Related