I want to scrape following webpage (it is allowed..):
https://www.bisafans.de/pokedex/listen/numerisch.php
the aim is to extract a table like following:
number | name | type1 | type2 |
---|---|---|---|
001 | Bisasam | Pflanze | Gift |
002 | ... | ... | ... |
I was able to scrape the number and name of the table but I have problem to extract the types since they are hidden as an image title:
>img src="https://media.bisafans.de/f630aa6/typen/pflanze.png" alt="Pflanze"<
How can I extract the name after alt
? I already tried it with extracting the whole table, which only extracts numbers and names. Another approach was the html_attr()
, but doesn't work either.
Does someone know how I can achieve this?
CodePudding user response:
First read in the html:
library(rvest)
res <- read_html('https://www.bisafans.de/pokedex/listen/numerisch.php')
Now extract the table:
tab <- res %>% html_table() %>% `[[`(1)
Get rid of the ???
entries at the bottom of the table that don't have any Typen
images:
tab <- tab[tab[[2]] != '???', ]
Use xpath to get the nodes containing the first image for each Typen
and extract their alt
attribute, then insert that into the Typen
column in tab
tab$Typen <- res %>% html_nodes(xpath = "//td/a[1]/img") %>% html_attr('alt')
This gives you:
tab
#> # A tibble: 908 x 3
#> Nr. Pokémon Typen
#> <int> <chr> <chr>
#> 1 1 Bisasam Pflanze
#> 2 2 Bisaknosp Pflanze
#> 3 3 Bisaflor Pflanze
#> 4 4 Glumanda Feuer
#> 5 5 Glutexo Feuer
#> 6 6 Glurak Feuer
#> 7 7 Schiggy Wasser
#> 8 8 Schillok Wasser
#> 9 9 Turtok Wasser
#> 10 10 Raupy Kaefer
#> # ... with 898 more rows
CodePudding user response:
Here is an alternative. This wasn't easy as rvest
only extracts text by default and this is hard coded into the function. But since we know exactly how the table should look, we can iterate over the row xml nodes and place each item into a column:
library(rvest)
library(tidyverse)
# read html
html <- read_html("https://www.bisafans.de/pokedex/listen/numerisch.php")
html %>%
# select tr nodes aka rows
html_nodes(".table tr") %>%
# map_df applies the function to each row and binds the results into
# one data frame
map_df(function(x) {
# first exctract text
text <- html_text(x, trim = TRUE)
# this comes out as one string so let's split it into cells
text <- strsplit(text, "\\n")[[1]]
# next extract alt descriptions
type <- html_nodes(x, "img") %>% html_attr("alt")
# if there is more then one, collapse them into one string,
# removing empty ones
type <- paste0(type[type != ""], collapse = ", ")
# combine text and alt into a vector
out <- c(text, type[type != ""])
# transform it to a data frame
tibble(
Nr = out[1],
Pokemon = out[2],
Typen = out[3]
)
}) %>%
slice(-1)
#> # A tibble: 912 × 3
#> Nr Pokemon Typen
#> <chr> <chr> <chr>
#> 1 001 Bisasam Pflanze, Gift
#> 2 002 Bisaknosp Pflanze, Gift
#> 3 003 Bisaflor Pflanze, Gift
#> 4 004 Glumanda Feuer
#> 5 005 Glutexo Feuer
#> 6 006 Glurak Feuer, Flug
#> 7 007 Schiggy Wasser
#> 8 008 Schillok Wasser
#> 9 009 Turtok Wasser
#> 10 010 Raupy Kaefer
#> # … with 902 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
CodePudding user response:
After many trials, was able to get both the Typen
for a given Pokemon
First we shall write a function to go through xpath
of each pokemon and get necessary info.
f1 = function(n){
xx = paste0('//*[@id="content"]/section/div/table/tbody/tr[', n, ']')
Pokémon = res %>% html_nodes(xpath = xx) %>% html_nodes('a') %>% html_text() %>% str_subset(". ")
type = res %>% html_nodes(xpath = xx) %>% html_nodes('a') %>%
html_nodes('img') %>% html_attr('alt')
dat = data.frame(Pokémon, type)
return(dat)
}
Then we shall use lapply
to go though all xpath
and get a list. Due to ???
we shall use tryCatch
to skip them.
df = lapply(1:912, function(x){
tryCatch(f1(x), error=function(e) NA)
}
)
#convert to dataframe
df = do.call(rbind.data.frame, df)
Finally, to get desired output we shall use pivot_wider
,
df %>% group_by(Pokémon) %>%
mutate(n = row_number()) %>%
pivot_wider(
names_from = "n",
names_prefix = "type_",
values_from = "type") %>% select_if(function(x) !(all(is.na(x)) | all(x=="")))
# A tibble: 909 x 3
# Groups: Pokémon [909]
Pokémon type_1 type_2
<chr> <chr> <chr>
1 Bisasam Pflanze Gift
2 Bisaknosp Pflanze Gift
3 Bisaflor Pflanze Gift
4 Glumanda Feuer NA
5 Glutexo Feuer NA