Scraping name(values) from attributes in rvest R-CodePudding

I want to scrape following webpage (it is allowed..):

https://www.bisafans.de/pokedex/listen/numerisch.php

the aim is to extract a table like following:

number	name	type1	type2
001	Bisasam	Pflanze	Gift
002	...	...	...

I was able to scrape the number and name of the table but I have problem to extract the types since they are hidden as an image title:

>img src="https://media.bisafans.de/f630aa6/typen/pflanze.png" alt="Pflanze"<

How can I extract the name after alt? I already tried it with extracting the whole table, which only extracts numbers and names. Another approach was the html_attr(), but doesn't work either.

Does someone know how I can achieve this?

CodePudding user response：

First read in the html:

library(rvest)

res <- read_html('https://www.bisafans.de/pokedex/listen/numerisch.php')

Now extract the table:

tab <- res %>% html_table() %>% `[[`(1)

Get rid of the ??? entries at the bottom of the table that don't have any Typen images:

tab <- tab[tab[[2]] != '???', ]

Use xpath to get the nodes containing the first image for each Typen and extract their alt attribute, then insert that into the Typen column in tab

tab$Typen <- res %>% html_nodes(xpath = "//td/a[1]/img") %>% html_attr('alt')

This gives you:

tab
#> # A tibble: 908 x 3
#>      Nr. Pokémon   Typen  
#>    <int> <chr>     <chr>  
#>  1     1 Bisasam   Pflanze
#>  2     2 Bisaknosp Pflanze
#>  3     3 Bisaflor  Pflanze
#>  4     4 Glumanda  Feuer  
#>  5     5 Glutexo   Feuer  
#>  6     6 Glurak    Feuer  
#>  7     7 Schiggy   Wasser 
#>  8     8 Schillok  Wasser 
#>  9     9 Turtok    Wasser 
#> 10    10 Raupy     Kaefer 
#> # ... with 898 more rows

CodePudding user response：

Here is an alternative. This wasn't easy as rvest only extracts text by default and this is hard coded into the function. But since we know exactly how the table should look, we can iterate over the row xml nodes and place each item into a column:

library(rvest)
library(tidyverse)
# read html
html <- read_html("https://www.bisafans.de/pokedex/listen/numerisch.php")

html %>% 
  # select tr nodes aka rows
  html_nodes(".table tr") %>% 
  # map_df applies the function to each row and binds the results into
  # one data frame
  map_df(function(x) {

    # first exctract text
    text <- html_text(x, trim = TRUE)
    # this comes out as one string so let's split it into cells
    text <- strsplit(text, "\\n")[[1]]

    # next extract alt descriptions
    type <- html_nodes(x, "img") %>% html_attr("alt")
    # if there is more then one, collapse them into one string, 
    # removing empty ones
    type <- paste0(type[type != ""], collapse = ", ")

    # combine text and alt into a vector
    out <- c(text, type[type != ""])
    # transform it to a data frame
    tibble(
      Nr = out[1],
      Pokemon   = out[2],
      Typen = out[3]
    )
  }) %>% 
  slice(-1)
#> # A tibble: 912 × 3
#>    Nr    Pokemon    Typen        
#>    <chr> <chr>      <chr>        
#>  1 001    Bisasam   Pflanze, Gift
#>  2 002    Bisaknosp Pflanze, Gift
#>  3 003    Bisaflor  Pflanze, Gift
#>  4 004    Glumanda  Feuer        
#>  5 005    Glutexo   Feuer        
#>  6 006    Glurak    Feuer, Flug  
#>  7 007    Schiggy   Wasser       
#>  8 008    Schillok  Wasser       
#>  9 009    Turtok    Wasser       
#> 10 010    Raupy     Kaefer       
#> # … with 902 more rows

^{Created on 2022-03-25 by the reprex package (v2.0.1)}

CodePudding user response：

After many trials, was able to get both the Typen for a given Pokemon

First we shall write a function to go through xpath of each pokemon and get necessary info.

f1 = function(n){
xx =  paste0('//*[@id="content"]/section/div/table/tbody/tr[', n, ']')

Pokémon = res %>% html_nodes(xpath = xx) %>% html_nodes('a') %>% html_text() %>% str_subset(". ")

type = res %>% html_nodes(xpath = xx) %>% html_nodes('a') %>% 
  html_nodes('img') %>% html_attr('alt')

dat = data.frame(Pokémon, type)
return(dat)
}

Then we shall use lapply to go though all xpath and get a list. Due to ??? we shall use tryCatch to skip them.

df = lapply(1:912, function(x){ 
  tryCatch(f1(x), error=function(e) NA)
  }
)
#convert to dataframe
df = do.call(rbind.data.frame, df)

Finally, to get desired output we shall use pivot_wider,

    df %>% group_by(Pokémon) %>% 
  mutate(n = row_number()) %>% 
  pivot_wider(
    names_from = "n", 
    names_prefix = "type_", 
    values_from = "type") %>% select_if(function(x) !(all(is.na(x)) | all(x=="")))

# A tibble: 909 x 3
# Groups:   Pokémon [909]
   Pokémon   type_1  type_2
   <chr>     <chr>   <chr> 
 1 Bisasam   Pflanze Gift  
 2 Bisaknosp Pflanze Gift  
 3 Bisaflor  Pflanze Gift  
 4 Glumanda  Feuer   NA    
 5 Glutexo   Feuer   NA