I am trying to write an R code to get the dates in the date column present in the webpage Eg- Mar 23, Sat. I viewed the source code and these dates are not present. So far I have tried the below stuff but nothing works (please excuse me if these codes appear silly I'm new to web scraping)
webpage <- read_html("https://www.cricbuzz.com/cricket-series/2810/indian-premier-league-2019/matches")
webpage %>% html_nodes(xpath = "//*[@id='series-matches']/div[4]/div[1]") %>% html_text()
#> [1] ""
webpage %>% html_nodes(xpath = "//html/body/div/div[2]/div[4]/div/div[6]/div[2]/span") %>% html_text()
#> [1] ""
webpage %>% html_nodes(xpath = "//html/body/div/div[2]/div[4]/div/div[6]/div[2]/span/ng-binding") %>% html_text()
#> character(0)
webpage %>% html_nodes(".ng-binding") %>% html_text()
#> character(0)
CodePudding user response:
The info is stored in the ng-bind
attribute of child span
elements whose direct parent has class "schedule-date".
<div ng-show="!filter_set"><span ng-bind=" 1553351400000| date:'MMM dd, EEE' : ' 05:30'" >Mar 23, Sat</span></div>
You can use the css selector list .schedule-date > span
to target these elements and subsequently extract the ng-bind
attribute values. You then have an epoch Unix timestamp, plus the UTC offset (Indian Standard Time) and the date formatting instructions.
1553351400000| date:'MMM dd, EEE' : ' 05:30'
You can extract the timestamp portion and apply the relevant transformations (as informed by the info after the pipe delimiter).
library(rvest)
library(tidyverse)
library(stringi)
strings_with_dates <- read_html("https://www.cricbuzz.com/cricket-series/2810/indian-premier-league-2019/matches") %>%
html_elements(".schedule-date > span") %>%
html_attr("ng-bind")
dates <- str_match(strings_with_dates, "(\\d ).*") %>%
.[, 2] %>%
as.numeric() %>%
map(function(x) x / 1000) %>%
unlist() %>%
as.POSIXct(origin = "1970-01-01", tz = "Asia/Kolkata") %>%
as.Date() %>%
stri_datetime_format(format = "MMM dd, EEE")
CodePudding user response:
ARSelenium
solution to get the dates,
url = 'https://www.cricbuzz.com/cricket-series/2810/indian-premier-league-2019/matches'
#Launch Browser
library(RSelenium)
library(rvest)
library(dplyr)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.schedule-date') %>%
html_nodes('.ng-binding') %>%
html_text()
[1] "Mar 23, Sat" "Mar 24, Sun" "Mar 25, Mon" "Mar 26, Tue" "Mar 27,