Home > Blockchain >  How to scrap a table from website while its class isn't a table
How to scrap a table from website while its class isn't a table

Time:11-16

I want to scrape the player data table from the following URL:

https://www.transfermarkt.de/mamadou-doucoure/profil/spieler/340480

Here's what I coded:

x <- read_html(url) %>%
        html_node(xpath = '//div[@]') %>%
        html_table(fill = TRUE) %>% 
        as.data.frame() %>%
        set_names(.,letters[1:ncol(.)])

As far as I understand, the player data isn't classed as a table, and I don't know how to edit the code. Also, I want to have the output in a data frame.

CodePudding user response:

Dataframe could have many forms, having that player table in dataframe as-is might not be the most practical way, though here are a few examples. Some parts are bit tricky and solving those correctly depends on context and objective (e.g. multiple nationalities that currently end up as a single collapsed value)

library(rvest)
library(dplyr, warn.conflicts = F)
library(tidyr)
library(stringr)

url <- "https://www.transfermarkt.de/mamadou-doucoure/profil/spieler/340480"
html <- read_html(url)

# most basic aproach to extract just what's in the table   player name:
df_01 <- tibble(
  feature = html_elements(html, "div.info-table > span.info-table__content--regular") %>% html_text() %>% str_squish(),
  text = html_elements(html, "div.info-table > span.info-table__content--bold") %>% html_text() %>% str_squish()
) %>%
  # player name is not included in div.info-table, add it separately
  add_row(.before = 1,
              feature = "Player:",
              text = html_elements(html, "header > div.data-header__headline-container > h1") %>% html_text() %>% str_squish())

df_01
#> # A tibble: 15 × 2
#>    feature              text                                   
#>    <chr>                <chr>                                  
#>  1 Player:              "#4 Mamadou Doucouré"                  
#>  2 Geburtsdatum:        "21.05.1998"                           
#>  3 Geburtsort:          "Dakar"                                
#>  4 Alter:               "24"                                   
#>  5 Größe:               "1,83 m"                               
#>  6 Nationalität:        "Frankreich Senegal"                   
#>  7 Position:            "Abwehr - Innenverteidiger"            
#>  8 Fuß:                 "links"                                
#>  9 Spielerberater:      "Sport Avenir Management International"
#> 10 Aktueller Verein:    "Borussia Mönchengladbach"             
#> 11 Im Team seit:        "01.07.2016"                           
#> 12 Vertrag bis:         "30.06.2024"                           
#> 13 Letzte Verlängerung: "14.02.2020"                           
#> 14 2. Verein:           "Borussia Mönchengladbach II (#3)"     
#> 15 Social Media:        ""

To include URLs we handle the first info-table column as before but processes 2nd one with through map - not all entries have URLs and we don't want to end up with misaligned columns with different lengths:

df_02 <- tibble(
  feature = html_elements(html, "div.info-table > span.info-table__content--regular") %>% html_text() %>% str_squish(),
) %>% bind_cols(
  purrr::map_df(
    html_elements(html, "div.info-table > span.info-table__content--bold"), 
    ~ list(
      html_text(.x) %>% stringr::str_squish() %>% na_if(""),
      html_element(.x, "a") %>% html_attr("href") 
    ) %>% setNames(c("text", "url"))
  )
) %>% add_row(.before = 1,
            feature = "Player:",
            text = html_elements(html, "header > div.data-header__headline-container > h1") %>% html_text() %>% stringr::str_squish())

df_02
#> # A tibble: 15 × 3
#>    feature              text                                  url               
#>    <chr>                <chr>                                 <chr>             
#>  1 Player:              #4 Mamadou Doucouré                   <NA>              
#>  2 Geburtsdatum:        21.05.1998                            /aktuell/waspassi…
#>  3 Geburtsort:          Dakar                                 <NA>              
#>  4 Alter:               24                                    <NA>              
#>  5 Größe:               1,83 m                                <NA>              
#>  6 Nationalität:        Frankreich Senegal                    <NA>              
#>  7 Position:            Abwehr - Innenverteidiger             <NA>              
#>  8 Fuß:                 links                                 <NA>              
#>  9 Spielerberater:      Sport Avenir Management International /sport-avenir-man…
#> 10 Aktueller Verein:    Borussia Mönchengladbach              /borussia-monchen…
#> 11 Im Team seit:        01.07.2016                            <NA>              
#> 12 Vertrag bis:         30.06.2024                            <NA>              
#> 13 Letzte Verlängerung: 14.02.2020                            <NA>              
#> 14 2. Verein:           Borussia Mönchengladbach II (#3)      /borussia-monchen…
#> 15 Social Media:        <NA>                                  http://www.instag…

To have a tidy dataframe that could potentially take more players, missing text values are replaced by URLs and separate URL column is dropped:

df_03 <- df_02 %>% 
  mutate(feature = janitor::make_clean_names(feature),
        `text` = coalesce(`text`,url))  %>% 
  select(-url) %>% 
  pivot_wider(names_from = feature, values_from = text) %>% 
  extract(player, into = c("number", "player"), "^#(\\d ) (.*)")

glimpse(df_03)
#> Rows: 1
#> Columns: 16
#> $ number              <chr> "4"
#> $ player              <chr> "Mamadou Doucouré"
#> $ geburtsdatum        <chr> "21.05.1998"
#> $ geburtsort          <chr> "Dakar"
#> $ alter               <chr> "24"
#> $ grosse              <chr> "1,83 m"
#> $ nationalitat        <chr> "Frankreich Senegal"
#> $ position            <chr> "Abwehr - Innenverteidiger"
#> $ fuss                <chr> "links"
#> $ spielerberater      <chr> "Sport Avenir Management International"
#> $ aktueller_verein    <chr> "Borussia Mönchengladbach"
#> $ im_team_seit        <chr> "01.07.2016"
#> $ vertrag_bis         <chr> "30.06.2024"
#> $ letzte_verlangerung <chr> "14.02.2020"
#> $ x2_verein           <chr> "Borussia Mönchengladbach II (#3)"
#> $ social_media        <chr> "http://www.instagram.com/mams_dcr/"
  • Related