I'm trying to automate downloading of the data contained here:
I can fairly easily specify the form, either through the url in the way: https://www.offenerhaushalt.at/gemeinde/innsbruck/download?year=2022&haushalt=fhh&rechnungsabschluss=va&origin=gemeinde
Or through the rvest
function html_form()
, but I cannot download the form as the html_form_submit()
throws the error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
library(rvest)
library(tidyverse)
html_form(read_html("https://www.offenerhaushalt.at/gemeinde/innsbruck/download"))[[1]] %>%
html_form_set(year = "2022",
haushalt = "fhh",
rechnungsabschluss = "va",
origin = "gemeinde") %>%
html_form_submit()
Any ideas on how to capture the file that is generated afterwards and download it?
It seems to me that it sends the "action" to a url that looks like: https://www.offenerhaushalt.at/downloads/ghdByParams
But I'm not sure what to do with that.
Thanks all!
CodePudding user response:
You can manually set the action url for that form:
library(rvest)
library(purrr)
dl_url <- "https://www.offenerhaushalt.at/gemeinde/innsbruck/download"
sess <- session(dl_url)
form <- sess %>% read_html() %>% html_form() %>% .[[1]]
# list valid options for select boxes
map(form$fields, "options") %>% keep(~ length(.x) > 0) %>%
imap_dfr(~ list(field = .y, options = paste(.x, collapse = " ")))
#> # A tibble: 4 × 2
#> field options
#> <chr> <chr>
#> 1 haushalt default fhh ehh vhh
#> 2 rechnungsabschluss default ra va
#> 3 year default 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 …
#> 4 origin default statistik_at gemeinde
# set values
form$fields$haushalt$value <- "fhh"
form$fields$rechnungsabschluss$value <- "ra"
form$fields$year$value <- "2020"
form$fields$origin$value <- "statistik_at"
# manually set form method & action
form$method <- "POST"
form$action <- "https://www.offenerhaushalt.at/downloads/ghdByParams"
# submit form
sess <- session_submit(sess, form)
# response headers
imap_dfr(sess$response$headers, ~ list(header = .y, value = .x))
#> # A tibble: 10 × 2
#> header value
#> <chr> <chr>
#> 1 date Sat, 21 Jan 2023 01:47:13 GMT
#> 2 server Apache
#> 3 content-disposition attachment; filename=offenerhaushalt_70101_2020_ra_fhh.c…
#> 4 pragma no-cache
#> 5 cache-control must-revalidate, post-check=0, pre-check=0, private
#> 6 expires 0
#> 7 set-cookie XSRF-TOKEN=eyJpdiI6IjdHd2pSakwzV09xb3Jab05zXC81em1RPT0iL…
#> 8 set-cookie offener_haushalt_session=eyJpdiI6IjI5cUN5MGhCSmVadmN5enV…
#> 9 transfer-encoding chunked
#> 10 content-type text/csv; charset=UTF-8
# parse attached CSV
httr::content(sess$response, as = "text") %>% readr::read_csv2()
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1408 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (8): ansatz_uab, ansatz_ugl, konto_grp, konto_ugl, sonst_ugl, vorhabenco...
#> dbl (2): mvag, wert
#> lgl (1): verguetung
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,408 × 11
#> ansat…¹ ansat…² konto…³ konto…⁴ sonst…⁵ vergu…⁶ vorha…⁷ mvag ansat…⁸ konto…⁹
#> <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <dbl> <chr> <chr>
#> 1 000 000 042 000 000 NA 0000000 3415 Gewähl… Amts-,…
#> 2 000 000 070 000 000 NA 0000000 3411 Gewähl… Aktivi…
#> 3 000 000 400 000 000 NA 0000000 3221 Gewähl… Gering…
#> 4 000 000 413 000 000 NA 0000000 3221 Gewähl… Handel…
#> 5 000 000 456 000 000 NA 0000000 3221 Gewähl… Schrei…
#> 6 000 000 457 000 000 NA 0000000 3221 Gewähl… Druckw…
#> 7 000 000 459 000 000 NA 0000000 3221 Gewähl… Sonsti…
#> 8 000 000 618 000 000 NA 0000000 3224 Gewähl… Instan…
#> 9 000 000 621 000 000 NA 0000000 3222 Gewähl… Sonsti…
#> 10 000 000 631 000 000 NA 0000000 3222 Gewähl… Teleko…
#> # … with 1,398 more rows, 1 more variable: wert <dbl>, and abbreviated variable
#> # names ¹ansatz_uab, ²ansatz_ugl, ³konto_grp, ⁴konto_ugl, ⁵sonst_ugl,
#> # ⁶verguetung, ⁷vorhabencode, ⁸ansatz_text, ⁹konto_text
As rvest accepts and passes on httr configs, attached files can be saved directly too:
dest_file <- tempfile(fileext = ".csv")
session_submit(sess, form, submit = NULL, httr::write_disk(dest_file))
# browseURL(dirname(dest_file))