how to scrape text from an icon

I'm trying to scrape all the data from this website. There are icons over some of the competitors names indicating that the person was disqualified for being a 'no-show'.

I would like create a data frame with all the competitors while also specifying who was disqualified, but I'm running into two issues:

(1) trying to add the disclaimer next to the persons name produces the error cannot coerce class ‘"xml_nodeset"’ to a data.frame.

(2) trying to extract the text from just the icon (and not the competitor names) produces a blank data frame.

library(rvest); library(tidyverse)
html = read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')

dq = data.frame(winner = html %>%
                  html_nodes('.match-card__competitor--red') %>%
                  html_text(trim = TRUE),
                opponent = html %>%
                  html_nodes('hr  .match-card__competitor'),
                dq = html %>%
                  html_nodes('.match-card__disqualification') %>%
                  html_text())

CodePudding user response：

This approach generally works only on tabular data where you can be sure that the number of matches for each of those selectors are constant and order is also fixed. In your example you have:
127 matches for .match-card__competitor--red
127 matches for hr .match-card__competitor
14 matches for .match-card__disqualification (you get no results for this because you should use html_attr("title") for title attribute instead of html_text())

Basically you are trying to combine columns of different lengths into the same dataframe. Even if it would work, you'd just add DSQ for 14 first matches.

As you'd probably want to keep information about matched, participants, results and disqualifications instead of just having a list of participants, I'd suggest to work with a list of match cards, i.e. extract all required information from a single card while not breaking relations and then move to the next card.

My purrr is far from perfect, but perhaps something like this:

library(rvest)
library(magrittr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)

# helpers -----------------------------------------------------------------

# to keep matches with details (when/where) in header
is_valid_match <- function(element){
  return(length(html_elements(element, ".bracket-match-header")) > 0)
}
# detect winner
is_winner <- function(element){
  return(length(html_elements(element, ".match-competitor--loser")) < 1 )
}
# extract data from competitor sections
comp_details <- function(comp_card, prefix="_"){
  l = lst()
  l[paste(prefix, "n",    sep = "")] <- comp_card %>% html_element(".match-card__competitor-n") %>% html_text()
  l[paste(prefix, "name", sep = "")] <- comp_card %>% html_element(".match-card__competitor-name") %>% html_text()
  l[paste(prefix, "club", sep = "")] <- comp_card %>% html_element(".match-card__club-name") %>% html_text()
  l[paste(prefix, "dq",   sep = "")] <- comp_card %>% html_element(".match-card__disqualification") %>% html_attr("title")
  l[paste(prefix, "won",  sep = "")] <- comp_card %>% html_element(".match-competitor--loser") %>% length() == 0
  return(l) 
}


# scrape & process --------------------------------------------------------

html <- read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')

html %>% 
  # collect all match cards
  html_elements("div.tournament-category__match") %>% 
  keep(is_valid_match) %>% 
  # apply anonymous function to every item in the list of match cards
  map(function(match_card){
    match_id <- match_card %>% html_element(".tournament-category__match-card") %>% html_attr("id")
    where    <- match_card %>% html_element(".bracket-match-header__where") %>% html_text()
    when     <- match_card %>% html_element(".bracket-match-header__when")  %>% html_text()
    
    competitors <- html_nodes(match_card, ".match-card__competitor")
    # extract competitior data
    comp01 <- competitors[[1]] %>% comp_details(prefix = "comp01_")
    comp02 <- competitors[[2]] %>% comp_details(prefix = "comp02_")
    
    winner_idx <- competitors %>% detect_index(is_winner)
    # lst for creating a named list 
    l <- lst(match_id, where, when, winner_idx)
    # combine all items and comp lists into single list
    l <- c(l,comp01, comp02)
    return(l)
  }) %>% 
  # each resulting list item into single-row tibble
  map(as_tibble) %>% 
  # reduce list of tibbles into single tibble
  reduce(bind_rows)

Result:

#> # A tibble: 65 × 14
#>    match_id  where when  winne…¹ comp0…² comp0…³ comp0…⁴ comp0…⁵ comp0…⁶ comp0…⁷
#>    <chr>     <chr> <chr>   <int> <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>  
#>  1 match-1-1 FIGH… Sat …       2 58      Christ… Rodrig… <NA>    FALSE   66     
#>  2 match-1-9 FIGH… Sat …       2 6       Melvin… GF Team Disqua… FALSE   66     
#>  3 match-1-… FIGH… Sat …       2 47      Eric R… Atos J… <NA>    FALSE   66     
#>  4 match-1-… FIGH… Sat …       1 47      Eric R… Atos J… <NA>    TRUE    10     
#>  5 match-1-… FIGH… Sat …       2 42      Ivan M… CheckM… <NA>    FALSE   66     
#>  6 match-1-… FIGH… Sat …       2 18      Joel S… Gracie… <NA>    FALSE   47     
#>  7 match-1-… FIGH… Sat …       1 42      Ivan M… CheckM… <NA>    TRUE    26     
#>  8 match-1-… FIGH… Sat …       2 34      Matthe… Super … <NA>    FALSE   18     
#>  9 match-2-9 FIGH… Sat …       1 62      Bryan … Team J… <NA>    TRUE    4      
#> 10 match-2-… FIGH… Sat …       2 22      Steffe… Six Bl… <NA>    FALSE   30     
#> # … with 55 more rows, 4 more variables: comp02_name <chr>, comp02_club <chr>,
#> #   comp02_dq <chr>, comp02_won <lgl>, and abbreviated variable names
#> #   ¹winner_idx, ²comp01_n, ³comp01_name, ⁴comp01_club, ⁵comp01_dq,
#> #   ⁶comp01_won, ⁷comp02_n

^{Created on 2022-09-19 with reprex v2.0.2}

Also note that not all matches have a winner and both participants can be disqualified (screenshot), so splitting them to winners & opponents might not be optimal.