Home > Net >  Parsing rvest output from an unstructured infobox
Parsing rvest output from an unstructured infobox

Time:05-18

I am attempted to extract data from a wiki fandom website using the rvest package in R. However, I am running into several issues because the infobox is not structured as an HTML table. Please see below for my attempts at dealing with this issue:

library(tidyverse)
library(data.table)
library(rvest)
library(httr)

url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")

#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
  html_table()

#So now just extract data using the CSS selector
df <- read_html(url) %>%
  html_element("aside")
  html_text2()

The second attempt does succeed at extracting the raw data, but it is formatted in a way that is not easy to format into a clean dataframe. So, then I attempted to extract each element of the table individually, which might be easier to clean and structure into a dataframe. However, when I attempt to do so using the XPath, I get an empty result:

df <- read_html(url) %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/aside/section[1]') %>%
  html_text2() 

So I suppose my question is primarily: does anyone know of a good way to automatically extract the infobox in a datarfame friendly format? If not, would someone be able to point me towards why my attempt to extract each panel individually is not working?

CodePudding user response:

If you target the div.pi-data directly, you could do something like this:

bind_rows(
  read_html(url) %>%
    rvest::html_nodes("div.pi-data") %>% 
    map(.f = ~tibble(
      label = html_elements(.x, ".pi-data-label") %>% html_text2(),
      text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
    ) %>% unnest(text)
    )
)

Output:

# A tibble: 29 x 2
   label      text                                                              
   <chr>      <chr>                                                             
 1 Homeworld  Tatooine[1]                                                       
 2 Born       41 BBY,[2] Tatooine[3]                                            
 3 Died       4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
 4 Species    Human[1]                                                          
 5 Gender     Male[1]                                                           
 6 Height     1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]        
 7 Mass       120 kilograms in armor[7]                                         
 8 Hair color Blond,[8] light[9] and dark[10]                                   
 9 Eye color  Blue,[11] later yellow (dark side)[12]                            
10 Skin color Light,[11] later pale[5]                                          
# ... with 19 more rows
  • Related