I am using beautifulsoup to scrape website data. I am getting a handle on how to scrape things that are displayed on the webpage, however, there is a unique identifier embedded in the html that I want to grab that doesn't have a title. For example:
<tbody><tr ><th scope="row" data-stat="ranker" csk="1" >1</th><td data-stat="pos" csk="1" ><strong>C</strong></td><td data-append-csv="mccanja02" data-stat="player" csk="McCann,James" ><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td><td data-stat="age" >32</td><td data-stat="G" >13</td><td data-stat="PA" >42</td><td data-stat="AB" >36</td><td data-stat="R" >5</td><td data-stat="H" >7</td><td data-stat="2B" >2</td><td data-stat="3B" >0</td><td data-stat="HR" >1</td><td data-stat="RBI" >5</td><td data-stat="SB" >1</td><td data-stat="CS" >0</td><td data-stat="BB" >2</td><td data-stat="SO" >7</td><td data-stat="batting_avg" >.194</td><td data-stat="onbase_perc" >.286</td><td data-stat="slugging_perc" >.333</td><td data-stat="onbase_plus_slugging" >.619</td><td data-stat="onbase_plus_slugging_plus" >87</td><td data-stat="TB" >12</td><td data-stat="GIDP" >1</td><td data-stat="HBP" >3</td><td data-stat="SH" >0</td><td data-stat="SF" >1</td><td data-stat="IBB" >0</td></tr>
I want to grab just "mccanja02" because this can be used to add to a URL and direct to the players specific page. So far I've tried something like this:
# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
playerUID = rowUID.find('td', {'data-append-csv'})
if playerUID:
playerUID = playerUID.text
print(playerUID)
But there is no title to connect it with, like if I wanted to grab the player's name I could just do:
# grab players name
rows = tableTeamBatting.find_all('tr')
for row in rows:
players = []
player = row.find('td', {'data-stat' : 'player'})
if player:
player = player.text
print(player)
I couldn't get @F.Hoque's solution to output exactly so I made this monstrosity:
# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
playerUID = rowUID.select('a[href]')
playerUID = playerUID if playerUID else None
if playerUID == None:
continue
else:
pUID = str(playerUID)
pUID = pUID.split('/')
for p in range(len(pUID)):
if '.shtml' in pUID[p]:
stor = pUID[p].split('.shtml')
print(stor[0])
This gives me the pUID that I am looking for. The reason I could not use the code in the comment was because it would return this:
<td csk="McCann,James" data-append-csv="mccanja02" data-stat="player"><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td>
<td csk="Alonso,Pete" data-append-csv="alonspe01" data-stat="player"><strong><a href="/players/a/alonspe01.shtml">Pete Alonso</a></strong></td>
<td csk="McNeil,Jeff" data-append-csv="mcneije01" data-stat="player"><strong><a href="/players/m/mcneije01.shtml">Jeff McNeil</a>*</strong></td>
<td csk="Lindor,Francisco" data-append-csv="lindofr01" data-stat="player"><strong><a href="/players/l/lindofr01.shtml">Francisco Lindor</a>#</strong></td>...
And I was only looking for that data-append-csv=pUID. I appreciate the help though, I dug into some of the docs and was able to locate some stuff. I'm open to any suggestions on how to improve this.
CodePudding user response:
mccanja02
is an attribute value of data-append-csv
. So you can't call .text
to grab it . You can grab it using css selector as follows:
html='''
<html>
<body>
<tbody>
<tr>
<th csk="1" data-stat="ranker" scope="row">
1
</th>
<td csk="1" data-stat="pos">
<strong>
C
</strong>
</td>
<td csk="McCann,James" data-append-csv="mccanja02" data-stat="player">
<strong>
<a href="/players/m/mccanja02.shtml">
James McCann
</a>
</strong>
</td>
<td data-stat="age">
32
</td>
<td data-stat="G">
13
</td>
<td data-stat="PA">
42
</td>
<td data-stat="AB">
36
</td>
<td data-stat="R">
5
</td>
<td data-stat="H">
7
</td>
<td data-stat="2B">
2
</td>
<td data-stat="3B">
0
</td>
<td data-stat="HR">
1
</td>
<td data-stat="RBI">
5
</td>
<td data-stat="SB">
1
</td>
<td data-stat="CS">
0
</td>
<td data-stat="BB">
2
</td>
<td data-stat="SO">
7
</td>
<td data-stat="batting_avg">
.194
</td>
<td data-stat="onbase_perc">
.286
</td>
<td data-stat="slugging_perc">
.333
</td>
<td data-stat="onbase_plus_slugging">
.619
</td>
<td data-stat="onbase_plus_slugging_plus">
87
</td>
<td data-stat="TB">
12
</td>
<td data-stat="GIDP">
1
</td>
<td data-stat="HBP">
3
</td>
<td data-stat="SH">
0
</td>
<td data-stat="SF">
1
</td>
<td data-stat="IBB">
0
</td>
</tr>
</tbody>
</body>
</html>
'''
from bs4 import BeautifulSoup
tableTeamBatting=BeautifulSoup(html,'lxml')
#print(soup.prettify())
rowsUID = tableTeamBatting.select('tr')
for rowUID in rowsUID:
playerUID = rowUID.select_one('td[data-append-csv]')
playerUID = playerUID.get('data-append-csv')if playerUID else None
print(playerUID)
Output:
mccanja02