How can I filter out numbers from an html table in R?-CodePudding

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.

I have used the following approach in R:

# getting website data:

link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)

# show BID/ASK block:

doc.tables[2]

Which (doc.tables[2]) gives me in this case the result:

$`NULL`
  Bid 0,765
1 Ask  0,80

How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?

CodePudding user response：

The issue is the 0.765 is actually the name of your data.frame column.

Your data frame being doc.tables[[2]]

You can grab the name by calling names(doc.tables[[2]])[2])

store that as a variable like name <- names(doc.tables[[2]])[2])

then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.

Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])

CodePudding user response：

This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.

library(httr)
library(XML)

# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))

# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)

# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

CodePudding user response：

Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.

library(httr)
library(rvest)
library(dplyr)

link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"

page <- read_html(link)

tbl <- page %>%
  html_elements("tr") %>%
  html_text() %>%
  .[3:4] %>%
  stringr::str_replace_all(",", ".")

tibble(name = stringr::str_extract(tbl, "Ask|Bid"), 
       value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#>   name  value
#>   <chr> <dbl>
#> 1 Bid   0.765
#> 2 Ask   0.8

^{Created on 2022-03-26 by the reprex package (v2.0.1)}

Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.

library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))

link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"

page <- read_html(link)

page %>%
  html_elements("tr") %>%
  html_text() %>%
  .[3:4] %>%
  str_replace_all(",", ".") %>%
  tibble(name = str_extract(., "Ask|Bid"), 
         value = readr::parse_number(.)) %>%
  .[-1]
#> # A tibble: 2 x 2
#>   name  value
#>   <chr> <dbl>
#> 1 Bid   0.765
#> 2 Ask   0.8

^{Created on 2022-03-27 by the reprex package (v2.0.1)}