Home > Back-end >  Web Scraping from B3 iFrame
Web Scraping from B3 iFrame

Time:09-25

I am trying to download some data from BM&FBOVESPA reference rates page. There is an iframe in there, which is as follows:

https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-ptBR.asp?Data=23/09/2021&Mercadoria=DI1

Here is my code, which is returning a DF with only NA:

library(rvest)
library(stringr)

html_url <- "https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-ptBR.asp?Data=23/09/2021&Mercadoria=DI1"

html <- read_html(html_url)

data <- html %>%
  html_nodes("td") %>%
  html_text() %>%
  str_replace(",", ".") %>%
  as.numeric()

How can I download the data from this page?

CodePudding user response:

The tables are generated dynamically by JavaScript running in the browser. You can review the page source and get enough of an idea to write a regex to extract only the relevant strings defining the table html of interest. You can concatenate the extracted regex matches for each table, into a single string for each table and parse with html parser, then select the table. Those 2 tables need to be reversed from the order they appear in the page source, to match what you see on the page, and column bound together to create a single DataFrame. The first column needs to be pulled in separately, without relying on html_table(), and added to front of DataFrame. A possible re-factor might merge this step into the existing function via use of css selectors)

library(rvest)
library(dplyr)
library(stringr)

get_table <- function(number, r) {
  pat <- sprintf("MercFut%i.*'(.*)?'", number)
  t <- read_html(paste0(stringr::str_match_all(r, pat)[[1]][, 2], collapse = "")) |>
    html_table()
  return(tibble(t[[1]]))
}

r <- read_html("https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-ptBR.asp?Data=23/09/2021&Mercadoria=DI1") |>
  html_text()

tables <- lapply(c(2, 1), get_table, r)

df <- cbind(tables[[1]], tables[[2]])

first_col <- stringr::str_match_all(r, sprintf("MercFut%i.*'(.*)?'", 3))[[1]][, 2] |>
  paste0(collapse = "") |>
  read_html() |>
  html_elements("th, td") |>
  html_text()

df <- tibble::add_column(df, !!(first_col[1]) := first_col[-c(1)]) |>
  select(first_col[1], everything())

For older R versions, replace |> with %>%.

  • Related