I scrape information with rvest
and store it in a dataframe. All information on various institutions and their context characteristics is stored in one string. It looks similar to JSON
, but it isn't. I followed another stack post but am not successful. I think string manipulation should do the job. Finally, "title", "street", "number", etc. should be variables and each institution should be a row. Thank you very much
library('tidyverse')
library('rvest')
library('stringr')
library('stringi')
library('jsonlite')
rubyhash <- "https://www.blutspenden.de/blutspendedienste/#" %>%
read_html() %>%
html_nodes("body") %>%
html_nodes("script:first-of-type") %>%
html_text() %>%
as_tibble() %>%
slice(1)
substr(rubyhash$value,1,150)
"\n var instituionsmap_data = '[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\""
rubyhash$json <- str_replace(rubyhash$value, "var instituionsmap_data =", "")
rubyhash$json <- trimws(rubyhash$json)
substr(rubyhash$json,1,150)
"'[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\"Heidelberg\",\"phone\":\"06221 89466960"
fromJSON(rubyhash$json)
CodePudding user response:
The data you are trying to parse is an array of different json strings, each one containing the equivalent of a data frame row. As well as removing the javascript variable assignment at the start, you need to split the array up into its component json strings before parsing:
rubyhash$value %>%
str_replace("var instituionsmap_data = '\\[\\{", "") %>%
str_replace("\\}\\]';\n", '') %>% # Removes the javascript chars at the end
strsplit('\\},\\{') %>% # Split into component json strings
getElement(1) %>%
sapply(function(x) paste0('{', x, '}'), USE.NAMES = FALSE) %>%
lapply(function(x) as.data.frame(fromJSON(x))) %>%
bind_rows() %>%
as_tibble()
#> # A tibble: 195 x 14
#> title street number zip city phone fax email~1 email url rekon~2 uid
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
#> 1 Plas~ Hans-~ "2A" 69115 Heid~ 0622~ "" "info(~ java~ http~ 48 567
#> 2 Plas~ Kamps~ "88 -~ 44137 Dort~ 0231~ "" "info-~ java~ http~ 16 568
#> 3 Plas~ Roteb~ "25" 70178 Stut~ 0711~ "" "stutt~ java~ http~ 16 571
#> 4 Plas~ K1 2 "" 68159 Mann~ 6211~ "" "" java~ http~ 112 575
#> 5 DRK-~ Fried~ "" 68167 Mann~ 0621~ "" "" java~ www.~ 49 359
#> 6 DRK-~ Gunze~ "35" 76530 Bade~ 0722~ "" "" java~ www.~ 33 387
#> 7 DRK ~ Helmh~ "" 89081 Ulm 0731~ "" "" java~ www.~ 49 389
#> 8 Blut~ Im Ne~ "305" 69120 Heid~ 0622~ "" "" java~ http~ 49 400
#> 9 Blut~ Otfri~ "" 72076 Tübi~ 0707~ "" "bluts~ java~ www.~ 49 402
#> 10 Blut~ Diako~ "" 74523 Schw~ 0791~ "" "" java~ www.~ 32 403
#> # ... with 185 more rows, 2 more variables: lat <chr>, lon <chr>, and
#> # abbreviated variable names 1: email_display, 2: rekonvaleszentenplasma
Created on 2022-09-01 with reprex v2.0.2
CodePudding user response:
I propose this solution with easier code
library(tidyverse)
library(rvest)
library(httr2)
page <- "https://www.blutspenden.de/blutspendedienste/" %>%
request() %>%
req_perform() %>%
resp_body_html()
tibble(
title = page %>%
html_elements(".institutions__title") %>%
html_text2(),
location = page %>%
html_elements(".institutions__location") %>%
html_text2(),
address = page %>%
html_elements(".institutions__address") %>%
html_text2(),
phone = page %>%
html_elements(".institutions__item") %>%
map_chr(. %>%
html_element(".institutions__phone") %>%
html_text2),
position = page %>%
html_elements(".institutions__item") %>%
map_chr(. %>%
html_element(".institutions__position") %>%
html_text2)
)
# A tibble: 200 x 5
title locat~1 address phone posit~2
<chr> <chr> <chr> <chr> <chr>
1 Plasma Service Europe Aachen Aachen "Alter~ 0241~ https:~
2 Octapharma Plasma Aachen Aachen "Peter~ 0241~ https:~
3 Blutspendedienst der Uniklinik RWTH Aachen Aachen "Pauwe~ 0241~ www.uk~
4 Haema Plasmaspendezentrum Augsburg Augsbu~ "Phili~ 0821~ https:~
5 Institut für Transfusionsmedizin und Hämost~ Augsbu~ "Steng~ 0821~ https:~
6 DRK Blutspendedienst NSTOB Bad Fallingbostel Bad Fa~ "Konra~ NA https:~
7 DRK-Blutspendedienst Bad Kreuznach Bad Kr~ "Burgw~ 0671~ https:~
8 Blutspendedienst OWL - Bad Oeynhausen HDZ NRW Bad Oe~ "Georg~ 0573~ https:~
9 DRK-Blutspendedienst Bad Salzuflen Bad Sa~ "Heldm~ NA https:~
10 DRK-Blutspendedienst Baden-Baden Baden-~ "Gunze~ 0722~ www.bl~
# ... with 190 more rows, and abbreviated variable names 1: location,
# 2: position
# i Use `print(n = ...)` to see more rows