Home > Back-end >  Rvest: xlsx download
Rvest: xlsx download

Time:08-05

I'm trying to download an xlsx file with the code below:

library(rvest)

file <- "tesouro.csv"

site <- read_html("https://www.tesourotransparente.gov.br/publicacoes/boletim-resultado-do-tesouro-nacional-rtn/")

link <- site %>% html_nodes(xpath="//a[contains(text(), 'serie_historica_jun22.xlsx')]") %>% html_attr("href")

download.file(
  url = link,
  mode = "w", destfile = file
)

But the download is an empty xlsx with the html code inside the spreadsheet:

    <html>
    <head>
        <script>
        var isMobile = /iPhone|iPad|iPod|Android/i.test(navigator.userAgent)
        if(isMobile) {
            window.location = "https://cdn.tesouro.gov.br/sistemas-internos///apex//producao//sistemas//thot//arquivos//publicacoes/44179_1398982/anexos/16848_586003///serie_historica_jun22.xlsx?v=7366"
        }
    </script>
    </head>
    <frameset COLS="*" border=0>
        <frame SRC="https://cdn.tesouro.gov.br/sistemas-internos///apex//producao//sistemas//thot//arquivos//publicacoes/44179_1398982/anexos/16848_586003///serie_historica_jun22.xlsx?v=7366" frameborder=0>
    </frameset>
    </html>```

CodePudding user response:

You need to follow the link in the frame. Also, it's an xlsx, so you need to save it as one, not as a csv:

library(rvest)

site <- read_html(paste0("https://www.tesourotransparente.gov.br/publicacoes/",
                         "boletim-resultado-do-tesouro-nacional-rtn/"))

site %>% 
  html_nodes(xpath="//a[contains(text(), 'serie_historica_jun22.xlsx')]") %>% 
  html_attr("href") %>%
  read_html() %>%
  html_nodes(xpath = "//frameset/frame") %>%
  html_attr("src") %>%
  httr::GET() %>%
  httr::content("raw") %>%
  writeBin("tesouro.xlsx")

Now we have

tesouro.xlsx

enter image description here

  • Related