Home > Enterprise >  How can I filter out numbers from an html table in R?
How can I filter out numbers from an html table in R?

Time:03-27

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.

I have used the following approach in R:

# getting website data:

link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)

# show BID/ASK block:

doc.tables[2]

Which (doc.tables[2]) gives me in this case the result:

$`NULL`
  Bid 0,765
1 Ask  0,80

How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?

CodePudding user response:

The issue is the 0.765 is actually the name of your data.frame column.

Your data frame being doc.tables[[2]]

You can grab the name by calling names(doc.tables[[2]])[2])

store that as a variable like name <- names(doc.tables[[2]])[2])

then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.

Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])

CodePudding user response:

This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.

library(httr)
library(XML)

# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))

# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)

# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

CodePudding user response:

Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.

library(httr)
library(rvest)
library(dplyr)

link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"

page <- read_html(link)

tbl <- page %>%
  html_elements("tr") %>%
  html_text() %>%
  .[3:4] %>%
  stringr::str_replace_all(",", ".")

tibble(name = stringr::str_extract(tbl, "Ask|Bid"), 
       value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#>   name  value
#>   <chr> <dbl>
#> 1 Bid   0.765
#> 2 Ask   0.8

Created on 2022-03-26 by the reprex package (v2.0.1)


Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.

library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))

link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"

page <- read_html(link)

page %>%
  html_elements("tr") %>%
  html_text() %>%
  .[3:4] %>%
  str_replace_all(",", ".") %>%
  tibble(name = str_extract(., "Ask|Bid"), 
         value = readr::parse_number(.)) %>%
  .[-1]
#> # A tibble: 2 x 2
#>   name  value
#>   <chr> <dbl>
#> 1 Bid   0.765
#> 2 Ask   0.8

Created on 2022-03-27 by the reprex package (v2.0.1)

  • Related