Home > Software design >  Web scrape uneven non table content - problem when multiple values for a heading
Web scrape uneven non table content - problem when multiple values for a heading

Time:12-07

I am trying to scrape basic player information for cricket players from their profiles on the cricinfo website. An example of a player profile page is given here: https://www.espncricinfo.com/player/shaun-marsh-6683

Ultimately, I would like to write a function in R to extract the information at the top of the overview tab (Full Name, Born, Age etc), and would like to put the information into a dataframe in R. I then have another function which will allow me to do this for multiple players of interest.

However, there are 2 main issues: the first is that not all players have the same information categories on their overview pages. Therefore, I need to import the category headings (eg. full name, born, age etc) as well as their corresponding values for each player. I have done this using rvest in R with the following code:

player_info <- content %>%
    html_nodes(".player_overview-grid") %>%
    html_nodes(".player-card-description.gray-900") %>% 
    html_text()
  
  player_cats <- content %>% 
    html_nodes(".player_overview-grid") %>% 
    html_nodes(".player-card-heading") %>% 
    html_text()

newplayer <- data.frame(player_cats, player_info)

This gives the desired result for most players, however runs into an issue that I cannot figure out how to solve. Some players have two values in a given heading; for example, in the link given above, the player has two relations (a brother and father), and this therefore means that the player_cats and player_info vectors have different length.

Please could someone help me with a way to solve this issue. I think I somehow need to extract the categories and their values as pairs, rather than separately, if that makes sense. I would be happy just to extract the first value in a category if there are multiple entries, or alternatively to include the category heading multiple times in the final data frame in R. Either is ok.

Excuse me if this is a simple issue, I am very new to this. Many thanks

EDIT:

Let's say I apply the function to this player's page https://www.espncricinfo.com/player/wes-agar-959833, then the output is as desired, since each category only has one entry. That is, it gives me the following dataframe: seen in image 1 below, a dataframe of the information categories and their values for this player

However, the issue arises when I try to apply the function to the original profile listed: https://www.espncricinfo.com/player/shaun-marsh-6683. I get an error, since there are 9 categories, but 10 entries, and thus cannot use rbind. see pics 2,3,4. I need to find a way to scrape which category each value belongs to, so that I can replicate the category header in the dataframe in R. I would hope to see a dataframe with 10 rows, with 'relations' repeated in the first column OR a df with 9 rows with 'relations' once and the first value "GR Marsh" in the RH column.

CodePudding user response:

One way to solve is using html_text2 and xpath for each of the category:

library(rvest)
library(dply)

url = "https://www.espncricinfo.com/player/shaun-marsh-6683"

#create an empty dataframe to store results 
df = vector() 


for(i in 1:9){
#creating xpath for each of the nine category
nod = paste0('//*[@id="main-container"]/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]/div[', i, ']')
df1 = url %>%
  read_html() %>% 
  html_nodes(xpath =nod) %>% 
  html_text2()
#now we split the result into columns
df1= do.call(rbind, str_split(df1, "\n"))
df = rbind.data.frame(df, df1)
}

                 V1 V2                                         V3
    1     Full Name                            Shaun Edward Marsh
    2          Born    July 09, 1983, Narrogin, Western Australia
    3           Age                                      38y 147d
    4     Nicknames                                           Sos
    5 Batting Style                                 Left hand bat
    6 Bowling Style                        Slow left arm orthodox
    7  Playing Role                              Top order batter
    8        Height                                        1.84 m
    9     relations          GR Marsh (father),MR Marsh (brother)
  • Related