Home > database >  Downloading a dynamic file from html node with R
Downloading a dynamic file from html node with R

Time:12-06

So, I have the following script:

library(rvest)
library(xml2)

DOES <- session("https://ioes.dio.es.gov.br/portal/visualizacoes/diario_oficial")
DOES <-read_html(DOES)
x1b6 <- xml_find_all(DOES, "//a[@id='baixar-diario-completo']")
x1b6
{xml_nodeset (1)}
[1] <a href="/portal/edicoes/download/0" id="baixar-diario-completo">\n                        <img src=""  ...

It's the official journal from my local government. I'm trying to download a file in the xpath= html//body//div[2]//div[1]//div[1]//div[1]//div[1]//div[1]//a

The file changes everyday with a new journal edition, so I'm trying to create an extraction routine to download the file automatically everyday. When I inspect the element through Chrome, it generates the right daily href: https://ioes.dio.es.gov.br/portal/edicoes/download/7620 But in the code above, as you can see, the href ends with 0. How can I get the right path?

CodePudding user response:

I propose this solution. Simply supply the function with a date and the PDF will be downloaded to your environment automatically.

library(tidyverse)
library(httr2)

get_file <- function(date) {
  str_c("https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data/", date, 
        ".json?&subtheme=false") %>%
    request() %>%
    req_perform() %>%
    resp_body_json(simplifyVector = TRUE) %>%
    getElement("itens") %>%
    pull(id) %>% 
    str_c("https://ioes.dio.es.gov.br/portal/edicoes/download/", .) %>% 
    download.file(., mode = "wb", 
                  destfile = str_c(date, ".pdf"))
}

get_file("2022-11-30")
get_file(lubridate::today())

CodePudding user response:

From the google inspector, Network tab, i can get that the the site request the edition from "https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data.json". So you can obtain the id the following way:

resp_id <- httr::GET("https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data.json")
id <- httr::content(resp_id)$itens[[1]]$id
id
#> [1] 7623

Then paste it to the url to get it.

  • Related