I am trying to scrape basic player information for cricket players from their profiles on the cricinfo website. An example of a player profile page is given here: https://www.espncricinfo.com/player/shaun-marsh-6683
Ultimately, I would like to write a function in R to extract the information at the top of the overview tab (Full Name, Born, Age etc), and would like to put the information into a dataframe in R. I then have another function which will allow me to do this for multiple players of interest.
However, there are 2 main issues: the first is that not all players have the same information categories on their overview pages. Therefore, I need to import the category headings (eg. full name, born, age etc) as well as their corresponding values for each player. I have done this using rvest in R with the following code:
player_info <- content %>%
html_nodes(".player_overview-grid") %>%
html_nodes(".player-card-description.gray-900") %>%
html_text()
player_cats <- content %>%
html_nodes(".player_overview-grid") %>%
html_nodes(".player-card-heading") %>%
html_text()
newplayer <- data.frame(player_cats, player_info)
This gives the desired result for most players, however runs into an issue that I cannot figure out how to solve. Some players have two values in a given heading; for example, in the link given above, the player has two relations (a brother and father), and this therefore means that the 'player_cats' and 'player_info' vectors have different length.
Please could someone help me with a way to solve this issue. I think I somehow need to extract the categories and their values as pairs, rather than separately, if that makes sense. I would be happy just to extract the first value in a category if there are multiple entries, or alternatively to include the category heading multiple times in the final data frame in R. Either is ok.
Excuse me if this is a simple issue, I am very new to this. Many thanks
EDIT: Let's say I apply the function to this player's page https://www.espncricinfo.com/player/wes-agar-959833, then the output is as desired, since each category only has one entry. That is, it gives me the following dataframe: seen in image 1 below, a dataframe of the information categories and their values for this player
However, the issue arises when I try to apply the function to the original profile listed: https://www.espncricinfo.com/player/shaun-marsh-6683. I get an error, since there are 9 categories, but 10 entries, and thus cannot use rbind. see pics 2,3,4. I need to find a way to scrape which category each value belongs to, so that I can replicate the category header in the dataframe in R. I would hope to see a dataframe with 10 rows, with 'relations' repeated in the first column OR a df with 9 rows with 'relations' once and the first value "GR Marsh" in the RH column.
CodePudding user response:
One way to solve is using html_text2
and xpath
for each of the category:
library(rvest)
library(dply)
url = "https://www.espncricinfo.com/player/shaun-marsh-6683"
#create an empty dataframe to store results
df = vector()
for(i in 1:9){
#creating xpath for each of the nine category
nod = paste0('//*[@id="main-container"]/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]/div[', i, ']')
df1 = url %>%
read_html() %>%
html_nodes(xpath =nod) %>%
html_text2()
#now we split the result into columns
df1= do.call(rbind, str_split(df1, "\n"))
df = rbind.data.frame(df, df1)
}
V1 V2 V3
1 Full Name Shaun Edward Marsh
2 Born July 09, 1983, Narrogin, Western Australia
3 Age 38y 147d
4 Nicknames Sos
5 Batting Style Left hand bat
6 Bowling Style Slow left arm orthodox
7 Playing Role Top order batter
8 Height 1.84 m
9 relations GR Marsh (father),MR Marsh (brother)