How to grab the dates from a webpage in R-CodePudding

I am trying to write an R code to get the dates in the date column present in the webpage Eg: Mar 23, Sat. I viewed the source code and these dates are not present.

So far I have tried the below stuff but nothing works (please excuse me if these codes appear silly I'm new to web scraping)

webpage <- read_html("https://www.cricbuzz.com/cricket-series/2810/indian-premier-league-2019/matches")
webpage %>% html_nodes(xpath = "//*[@id='series-matches']/div[4]/div[1]") %>% html_text()
#> [1] ""

webpage %>% html_nodes(xpath = "//html/body/div/div[2]/div[4]/div/div[6]/div[2]/span") %>% html_text()
#> [1] ""

webpage %>% html_nodes(xpath = "//html/body/div/div[2]/div[4]/div/div[6]/div[2]/span/ng-binding") %>% html_text()
#> character(0)

webpage %>% html_nodes(".ng-binding") %>% html_text()
#> character(0)

CodePudding user response：

The info is stored in the ng-bind attribute of child span elements whose direct parent has class "schedule-date".

<div  ng-show="!filter_set"><span ng-bind=" 1553351400000| date:'MMM dd, EEE' : ' 05:30'" >Mar 23, Sat</span></div>

You can use the css selector list .schedule-date > span to target these elements and subsequently extract the ng-bind attribute values. You then have an epoch Unix timestamp, plus the UTC offset (Indian Standard Time) and the date formatting instructions.

1553351400000| date:'MMM dd, EEE' : ' 05:30'

You can extract the timestamp portion and apply the relevant transformations (as informed by the info after the pipe delimiter).

library(rvest)
library(tidyverse)
library(stringi)

strings_with_dates <- read_html("https://www.cricbuzz.com/cricket-series/2810/indian-premier-league-2019/matches") %>%
  html_elements(".schedule-date > span") %>%
  html_attr("ng-bind")

dates <- str_match(strings_with_dates, "(\\d ).*") %>%
  .[, 2] %>%
  as.numeric() %>%
  map(function(x) x / 1000) %>%
  unlist() %>%
  as.POSIXct(origin = "1970-01-01", tz = "Asia/Kolkata") %>%
  as.Date() %>%
  stri_datetime_format(format = "MMM dd, EEE")

CodePudding user response：

ARSelenium solution to get the dates,

url  = 'https://www.cricbuzz.com/cricket-series/2810/indian-premier-league-2019/matches'
#Launch Browser
library(RSelenium)
library(rvest)
library(dplyr)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)    


   
remDr$getPageSource()[[1]] %>% 
  read_html() %>%   html_nodes('.schedule-date') %>% 
  html_nodes('.ng-binding') %>%  
  html_text() 

[1] "Mar 23, Sat" "Mar 24, Sun" "Mar 25, Mon" "Mar 26, Tue" "Mar 27,