So, I have the following script:
library(rvest)
library(xml2)
DOES <- session("https://ioes.dio.es.gov.br/portal/visualizacoes/diario_oficial")
DOES <-read_html(DOES)
x1b6 <- xml_find_all(DOES, "//a[@id='baixar-diario-completo']")
x1b6
{xml_nodeset (1)}
[1] <a href="/portal/edicoes/download/0" id="baixar-diario-completo">\n <img src="" ...
It's the official journal from my local government. I'm trying to download a file in the xpath= html//body//div[2]//div[1]//div[1]//div[1]//div[1]//div[1]//a
The file changes everyday with a new journal edition, so I'm trying to create an extraction routine to download the file automatically everyday. When I inspect the element through Chrome, it generates the right daily href: https://ioes.dio.es.gov.br/portal/edicoes/download/7620 But in the code above, as you can see, the href ends with 0. How can I get the right path?
CodePudding user response:
I propose this solution. Simply supply the function with a date and the PDF will be downloaded to your environment automatically.
library(tidyverse)
library(httr2)
get_file <- function(date) {
str_c("https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data/", date,
".json?&subtheme=false") %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
getElement("itens") %>%
pull(id) %>%
str_c("https://ioes.dio.es.gov.br/portal/edicoes/download/", .) %>%
download.file(., mode = "wb",
destfile = str_c(date, ".pdf"))
}
get_file("2022-11-30")
get_file(lubridate::today())
CodePudding user response:
From the google inspector, Network tab, i can get that the the site request the edition from "https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data.json". So you can obtain the id the following way:
resp_id <- httr::GET("https://ioes.dio.es.gov.br/apifront/portal/edicoes/edicoes_from_data.json")
id <- httr::content(resp_id)$itens[[1]]$id
id
#> [1] 7623
Then paste it to the url to get it.