Home > database >  how to properly identify specific value to parse using rvest
how to properly identify specific value to parse using rvest

Time:12-17

Dear Collective Wisdom

I'm struggling trying to use rvest to parse a table from https://www.1944.pl/powstancze-biogramy,ord,nazwisko,0,strona,1.html

I need to loop through all nodes of the table and extract its values one by one. Then iterate to next page and repeat.

I intend to read the table values separately, because I need to add a variant loop in the code - for each row, if value in the "Data urodzenia" column equals "-" then the program should enter the webpage corresponding to that row and extract some other value (tagged "Rocznik") instead.

For now, I'm having trouble with forcing the rvest to read values from the table. I think I don't quite follow the idea of html selectors... I'm able to read the entire table (per page) using the (".museumTableRow") tag in the following function:

library(rvest) 
library(tidyverse)

page <- read_html("https://www.1944.pl/powstancze-biogramy,ord,nazwisko,0,strona,1.html")
getPage <- function(html){
  html %>% 
    html_nodes(".museumTableRow") %>%      
    html_text() %>% 
    str_trim() %>%                       
    unlist()                             
}

Append_page <- getPage(page)

...but as I try to use selectors for specific cells of the table, I get an empty ("character(0)") response. I was trying to find relevant tags by inspecting the page manually and by using the selectorgadget plugin as suggested by the library creators. These seem odd (to me), ex. for the first name in the "Nazwisko" column the selectorgadget suggests:

".footable-even:nth-child(1) .footable-first-column .museumTableRow"

so I was also trying to play with them, but I with no success. I guess I don't fully understand how it works. I would appreciate any suggestions on how to force rvest to read this table cell by cell and append values from subsequent cells to a data.table.

I hope this is specific enough.

CodePudding user response:

This should work:

library(glue)
page <- read_html("https://www.1944.pl/powstancze-biogramy,ord,nazwisko,0,strona,1.html")
dat <- page %>% html_elements(css="tbody tr")  
txt <- dat %>% html_text()
hrefs <- dat %>% html_element("a") %>% html_attr("href")
s <- lapply(1:length(txt), function(i)trimws(strsplit(txt[i], split="\\n")[[1]]))
out_txt <- t(sapply(s, function(x)x[which(x != "")]))
stem <- "https://www.1944.pl"
for(i in 1:nrow(out_txt)){
  if(out_txt[i,6] == "-"){
    u <- paste0(stem, hrefs[i])
    h <- read_html(u)
    btxt <- h %>% html_elements(css="div.biogram--info") %>% html_text()
    ind <- grep("Rocznik", btxt)
    if(length(ind) > 0){
      btxt2 <-   h %>% html_elements(css=glue("div.biogram--info:nth-child({ind-1})")) %>% html_text()
      out_txt[i,6] <- str_extract(btxt2, "\\d ")
    }else{
      out_txt[i,6] <- NA_character_
    }
  }
}
head(out_txt)
#             [,1]          [,2]         [,3]      [,4]            [,5]                [,6]         [,7]        
# [1,] "Abajew"      "Aleksander" "-"       "-"             "-"                 "1916-06-06" "-"         
# [2,] "Abakanowicz" "Piotr"      "-"       "-"             "\"Grey\""          "1890-06-21" "1948-06-01"
# [3,] "Abakanowicz" "Maria"      "-"       "-"             "\"Lena\""          "1901"       "-"         
# [4,] "Abczyńska"   "Alicja"     "Henryka" "sanitariuszka" "\"Ciocia Stasia\"" "1900-02-09" "1989-04-26"
# [5,] "Abczyńska"   "Janina"     "-"       "pielęgniarka"  "\"Julia\""         "1883-06-15" "1944-08-30"
# [6,] "Abczyński"   "Stanisław"  "-"       "-"             "\"Stefan\""        NA           "-"         

In the code above, it grabs the data and the href for the first <a> tag in the row. It then goes to that reference if the sixth column of the ith row is "-". If there is an entry labelled "Rocznik", it grabs the year if it exists, otherwise it replaces the entry with a missing value.

  • Related