Home > front end >  Scrape data within a table from webpage using rvest in R
Scrape data within a table from webpage using rvest in R

Time:01-27

I'm trying to get the data table from this webpage: http://rotoguru1.com/cgi-bin/fyday.pl?week=1&game=dk&scsv=1

This is the XPath of the data i want to extract: /html/body/table/tbody/tr/td[3]/pre.

i have tried:

url <- "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&game=dk&scsv=1"
DFS_table <- read_html(url) %>%
html_nodes(xpath = '/html/body/table/tbody/tr/td[3]/pre') %>%
html_table()
DFS_table<- DFS_table[[1]]

But get this error: Error in DFS_table[[1]] : subscript out of bounds.

When i try:

url <- "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&game=dk&scsv=1"
pg <- read_html(URL)
tab <- html_table(pg, fill=TRUE)[[1]]

It seems to get all the data displayed on the web page, so i am thinking my problem is maybe related to the fact the whole page is a table and i need to extract a part of that, but am unsure how to do it.

Any help is appreciated.

CodePudding user response:

html_table() tries to read a proper formated <table>[...]</table>, and looks like your data of interest is a preformated text.

library(rvest)
url <- "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&game=dk&scsv=1"

html <- read_html(url)
text <- html_text(html_nodes(html, xpath = './/td[3]/pre'))

library(stringi)
str_split(text, "\n")

Which gives you:

  [1] "Week;Year;GID;Name;Pos;Team;h/a;Oppt;DK points;DK salary"   
  [2] "1;2021;1523;Mahomes II, Patrick;QB;kan;h;cle;36.28;8100"    
  [3] "1;2021;1537;Murray, Kyler;QB;ari;a;ten;34.56;7600"          
  [4] "1;2021;1490;Goff, Jared;QB;det;h;sfo;32.92;5100"            
  [5] "1;2021;1131;Brady, Tom;QB;tam;h;dal;32.16;6700"             
  [6] "1;2021;1501;Prescott, Dak;QB;dal;a;tam;31.42;6200" 

CodePudding user response:

I assume that you want the data under the line "Semi-colon delimited format:". This is preformatted text, so trying to extract a table node won't work.

You can get that data straight into a data frame like this. Note the quote = argument to read.table, required because some player names contain single quotes.

url <- "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&game=dk&scsv=1"

mydata <- read_html(u) %>% 
  html_node("pre") %>% 
  html_text() %>% 
  read.table(text = ., sep = ";", header = TRUE, quote = "")

head(mydata)

  Week Year  GID                Name Pos Team h.a Oppt DK.points DK.salary
1    1 2021 1523 Mahomes II, Patrick  QB  kan   h  cle     36.28      8100
2    1 2021 1537       Murray, Kyler  QB  ari   a  ten     34.56      7600
3    1 2021 1490         Goff, Jared  QB  det   h  sfo     32.92      5100
4    1 2021 1131          Brady, Tom  QB  tam   h  dal     32.16      6700
5    1 2021 1501       Prescott, Dak  QB  dal   a  tam     31.42      6200
6    1 2021 1465     Winston, Jameis  QB  nor   h  gnb     29.62      5200
  •  Tags:  
  • Related