Web Scraping from B3 iFrame-CodePudding

I am trying to download some data from BM&FBOVESPA reference rates page. There is an iframe in there, which is as follows:

https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-ptBR.asp?Data=23/09/2021&Mercadoria=DI1

Here is my code, which is returning a DF with only NA:

library(rvest)
library(stringr)

html_url <- "https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-ptBR.asp?Data=23/09/2021&Mercadoria=DI1"

html <- read_html(html_url)

data <- html %>%
  html_nodes("td") %>%
  html_text() %>%
  str_replace(",", ".") %>%
  as.numeric()

How can I download the data from this page?

CodePudding user response：

The tables are generated dynamically by JavaScript running in the browser. You can review the page source and get enough of an idea to write a regex to extract only the relevant strings defining the table html of interest. You can concatenate the extracted regex matches for each table, into a single string for each table and parse with html parser, then select the table. Those 2 tables need to be reversed from the order they appear in the page source, to match what you see on the page, and column bound together to create a single DataFrame. The first column needs to be pulled in separately, without relying on html_table(), and added to front of DataFrame. A possible re-factor might merge this step into the existing function via use of css selectors)

library(rvest)
library(dplyr)
library(stringr)

get_table <- function(number, r) {
  pat <- sprintf("MercFut%i.*'(.*)?'", number)
  t <- read_html(paste0(stringr::str_match_all(r, pat)[[1]][, 2], collapse = "")) |>
    html_table()
  return(tibble(t[[1]]))
}

r <- read_html("https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-ptBR.asp?Data=23/09/2021&Mercadoria=DI1") |>
  html_text()

tables <- lapply(c(2, 1), get_table, r)

df <- cbind(tables[[1]], tables[[2]])

first_col <- stringr::str_match_all(r, sprintf("MercFut%i.*'(.*)?'", 3))[[1]][, 2] |>
  paste0(collapse = "") |>
  read_html() |>
  html_elements("th, td") |>
  html_text()

df <- tibble::add_column(df, !!(first_col[1]) := first_col[-c(1)]) |>
  select(first_col[1], everything())

For older R versions, replace |> with %>%.