I would like to extract the links
listed under "Information" on a webpage using R
. The data is publicly available and scraping is not prohibitet.
So I tried using httr::POST
but I still do not get the page content/table, I only get "Loading..."
#library
library(httr)
library(jsonlite)
library(rvest)
# set parameter
body <- list(
queryTerm="Vk_20220224_16",
fromDate="",
toDate="")
# POST
res <- POST(
"https://fsca.swissmedic.ch/",
body = jsonlite::toJSON(body),
encode = "form",
verbose()
)
# get results
read_html(res)
#> {html_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n<mep-app>Loading...</mep-app><script type="text/javascript" src=" ...
Created on 2022-12-23 with reprex v2.0.2
CodePudding user response:
How to request with httr2
library(httr2)
library(tidyverse)
"https://fsca.swissmedic.ch/mep/api/publications/search?pageNumber=0&sortingProperty=PUBLICATION_DATE&direction=DESC" %>%
request() %>%
req_body_json(
list(
fromDate = "2022-12-04",
toDate = "2022-12-20",
queryTerm = NULL,
onlyUpdates = "false"
)
) %>%
req_perform() %>%
resp_body_json(simplifyVector = T) %>%
pluck("content") %>%
as_tibble()
# A tibble: 37 × 9
publikationsDatum swissmedicRef hersteller status status…¹ begru…² devices freig…³ docum…⁴
<chr> <chr> <chr> <chr> <chr> <chr> <list> <lgl> <list>
1 2022-12-07 Vk_20221202_03 Medtronic CoreValve LLC UPDATE 2022-12… "Added… <df> TRUE <df>
2 2022-12-20 Vk_20221216_12 Biocartis NV UPDATE 2022-12… "Added… <df> TRUE <df>
3 2022-12-20 Vk_20221219_01 Siemens Healthcare GmbH FIRST 2022-12… "" <df> TRUE <df>
4 2022-12-20 Vk_20221216_19 Medicvent AB FIRST 2022-12… "" <df> TRUE <df>
5 2022-12-20 Vk_20221213_25 Macopharma FIRST 2022-12… "" <df> TRUE <df>
6 2022-12-20 Vk_20221208_26 Spiegelberg GmbH & Co. KG FIRST 2022-12… "" <df> TRUE <df>
7 2022-12-06 Vk_20221201_21 Fujifilm Corporation UPDATE 2022-12… "Rewor… <df> TRUE <df>
8 2022-12-20 Vk_20221216_15 Maquet Critical Care AB FIRST 2022-12… "" <df> TRUE <df>
9 2022-12-20 Vk_20221216_17 Siemens Healthcare GmbH FIRST 2022-12… "" <df> TRUE <df>
10 2022-12-20 Vk_20221215_03 custo med GmbH FIRST 2022-12… "" <df> TRUE <df>
# … with 27 more rows, and abbreviated variable names ¹statusDatum, ²begruendung, ³freigeschaltet,
# ⁴documents
# ℹ Use `print(n = ...)` to see more rows
With search parameter
"https://fsca.swissmedic.ch/mep/api/publications/search?pageNumber=0&sortingProperty=PUBLICATION_DATE&direction=DESC" %>%
request() %>%
req_body_json(list(
fromDate = NULL,
toDate = NULL,
queryTerm = "Vk_20220224_16",
onlyUpdates = "false"
)) %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("content") %>%
as_tibble() %>%
unnest(everything())
# A tibble: 3 × 16
publikatio…¹ swiss…² herst…³ status statu…⁴ begru…⁵ hande…⁶ sn lot swVer…⁷ model besch…⁸ freig…⁹ title
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 2022-03-07 Vk_202… Siemen… FIRST 2022-0… "" Artis … "" "" "" "" MD: St… TRUE DE-1
2 2022-03-07 Vk_202… Siemen… FIRST 2022-0… "" Artis Q "" "" "" "" MD: St… TRUE FR-1
3 2022-03-07 Vk_202… Siemen… FIRST 2022-0… "" Artis … "" "" "" "" MD: St… TRUE IT-1
# … with 2 more variables: language <chr>, version <chr>, and abbreviated variable names ¹publikationsDatum,
# ²swissmedicRef, ³hersteller, ⁴statusDatum, ⁵begruendung, ⁶handelsname, ⁷swVersion, ⁸beschreibungKlasse,
# ⁹freigeschaltet
# ℹ Use `colnames()` to see all variable names
Download links of the documents, which can be looped/mapped to auto download:
str_c("https://fsca.swissmedic.ch/mep/api/publications/", "Vk_20220224_16",
"/documents/", 0:(number_of_documents - 1))
[1] "https://fsca.swissmedic.ch/mep/api/publications/Vk_20220224_16/documents/0"
[2] "https://fsca.swissmedic.ch/mep/api/publications/Vk_20220224_16/documents/1"
[3] "https://fsca.swissmedic.ch/mep/api/publications/Vk_20220224_16/documents/2"