Scraping <li> item using rvest-CodePudding

I'd like to scrape https://www.deutsche-biographie.de/ . Specifically, I'm interested in scraping the following information about each individual

Name
Year of birth
Year of death
Profession
Place of birth ('geburt' in source code) and coordinates
Place of death ('tod' in source code) and coordinates
Places of activity ('wirk' in source code) and coordinates

With the code below, I scraped name, year of birth, year of death, and profession.

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("#secondColumn p") %>% html_text()
result = data.frame(name, information, stringsAsFactors = FALSE)

#manipulate data in columns
result$yearofbirth = sub("(^[^-] )-.*", "\\1", result$information) #extract characters before dash
result$yearofdeath = sub(',.*$','', result$information)
result$yearofdeath = sub('.*-','', result$yearofdeath) #extract characters after dash
result$profession = sub("^.*?,", "", result$information) #extract characters after comma
result$profession = trimws(result$profession, whitespace = "[ \t\r\n]") #trim leading and trailing white space
result$information = NULL

However, I'm struggling with scraping the place of birth/death/activity from the <li element. The source code looks as follows, with data-orte standing for the places of birth/death/activity (geburt/tod/wirk) and data-name standing for the name of the individual.

 <li class="media treffer-liste-elem" id="treffer-sfz55763" data-orte="[email protected],9.6596678@geburt;[email protected],9.6596678@wirk;[email protected],10.1371858@wirk;[email protected],11.6399609@wirk;[email protected],12.109015599915@wirk;Frankfurt/[email protected],14.5544166@wirk;[email protected],9.54054973309832@wirk;[email protected],11.8767269@wirk;[email protected],11.3430347@wirk;[email protected],7.5969912@wirk;Kö[email protected],20.5105165@wirk;[email protected],18.6542829@wirk;[email protected],14.4212126@wirk;[email protected],4.9001115@wirk;[email protected],8.6805975@wirk;[email protected],12.109015599915@wirk;[email protected],11.6399609@tod" data-name="Maier, Michael">

I would be very grateful for any hint on how to scrape the places! Best, Natalie

CodePudding user response：

I hope this solution helps:

page %>% 
  html_elements("#secondColumn > ul") %>% 
  html_children() %>% html_attr("data-orte") %>% 
  str_split(";")

CodePudding user response：

Another option to achieve your desired result may look like so:

First step is similar to the solution proposed by @Kafe: Get the information on places from the data-orte attribute and split by ; to get a list of places
As a second step I make use of lapply to put the places of birth, death and activity in separate columns of your result dataframe
In the third step I make heavy use of tidyr::extract which makes it easy to extract multiple pieces of information from a string and put them into separate columns in one step.

Note: I also used a different approach to extract the years of birth and death.

library(rvest)
library(dplyr)

page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \\n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
  rename(years = 2, profession = 3) %>% 
  tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\\d{4}).*?\\-\\s(\\d{4})")

places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")

result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])

result <- result %>% 
  tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>% 
  tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")

result
#> # A tibble: 10 × 9
#>    name   year_of_birth year_of_death profession place_of_birth place_of_birth_…
#>    <chr>  <chr>         <chr>         <chr>      <chr>          <chr>           
#>  1 Meier… 1718          1777          Philosoph  Ammendorf bei… 51.4265204,11.9…
#>  2 Meyer… 1772          1849          Jurist; B… Frankfurt/Main 50.1432793,8.68…
#>  3 Meier… 1809          1898          Bremer Ka… Bremen         53.0758099,8.80…
#>  4 Major… 1502          1574          lutherisc… Nürnberg       49.4538501,11.0…
#>  5 Meyer… 1810          1874          schweizer… Sursee Kanton… 47.1774826,8.10…
#>  6 Maier… 1568          1622          Alchemist… Rendsburg      54.3012661,9.65…
#>  7 Meier… 1692          1745          Jurist; A… Bayreuth       49.9427202,11.5…
#>  8 Mejer… 1818          1893          Jurist; P… Zellerfeld (H… 51.804126,10.33…
#>  9 Meyer… 1474          1548          Bürgermei… Basel          47.5429886,7.59…
#> 10 Hirsc… 1770          1851          Mathemati… Friesack (Mit… 52.7395263,12.5…
#> # … with 3 more variables: place_of_death <chr>, place_of_death_coord <chr>,
#> #   place_of_activity <list>

^{Created on 2021-11-21 by the reprex package (v2.0.1)}