Home > Blockchain >  Grab specific element in HTML using Python : BeautifulSoup4
Grab specific element in HTML using Python : BeautifulSoup4

Time:04-30

I am using beautifulsoup to scrape website data. I am getting a handle on how to scrape things that are displayed on the webpage, however, there is a unique identifier embedded in the html that I want to grab that doesn't have a title. For example:

<tbody><tr ><th scope="row"  data-stat="ranker" csk="1" >1</th><td  data-stat="pos" csk="1" ><strong>C</strong></td><td  data-append-csv="mccanja02" data-stat="player" csk="McCann,James" ><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td><td  data-stat="age" >32</td><td  data-stat="G" >13</td><td  data-stat="PA" >42</td><td  data-stat="AB" >36</td><td  data-stat="R" >5</td><td  data-stat="H" >7</td><td  data-stat="2B" >2</td><td  data-stat="3B" >0</td><td  data-stat="HR" >1</td><td  data-stat="RBI" >5</td><td  data-stat="SB" >1</td><td  data-stat="CS" >0</td><td  data-stat="BB" >2</td><td  data-stat="SO" >7</td><td  data-stat="batting_avg" >.194</td><td  data-stat="onbase_perc" >.286</td><td  data-stat="slugging_perc" >.333</td><td  data-stat="onbase_plus_slugging" >.619</td><td  data-stat="onbase_plus_slugging_plus" >87</td><td  data-stat="TB" >12</td><td  data-stat="GIDP" >1</td><td  data-stat="HBP" >3</td><td  data-stat="SH" >0</td><td  data-stat="SF" >1</td><td  data-stat="IBB" >0</td></tr>

I want to grab just "mccanja02" because this can be used to add to a URL and direct to the players specific page. So far I've tried something like this:

# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
    playerUID = rowUID.find('td', {'data-append-csv'})
    if playerUID:
        playerUID = playerUID.text
        print(playerUID)

But there is no title to connect it with, like if I wanted to grab the player's name I could just do:

# grab players name
rows = tableTeamBatting.find_all('tr')
for row in rows:
    players = []
    player = row.find('td', {'data-stat' : 'player'})
    if player:
        player = player.text
        print(player)

I couldn't get @F.Hoque's solution to output exactly so I made this monstrosity:

# grab players UID
rowsUID = tableTeamBatting.find_all('tr')
for rowUID in rowsUID:
    playerUID = rowUID.select('a[href]')
    playerUID = playerUID if playerUID else None
    if playerUID == None:
        continue
    else:
        pUID = str(playerUID)
        pUID = pUID.split('/')
        for p in range(len(pUID)):
            if '.shtml' in pUID[p]:
                stor = pUID[p].split('.shtml')
                print(stor[0])

This gives me the pUID that I am looking for. The reason I could not use the code in the comment was because it would return this:

<td  csk="McCann,James" data-append-csv="mccanja02" data-stat="player"><strong><a href="/players/m/mccanja02.shtml">James McCann</a></strong></td>
<td  csk="Alonso,Pete" data-append-csv="alonspe01" data-stat="player"><strong><a href="/players/a/alonspe01.shtml">Pete Alonso</a></strong></td>
<td  csk="McNeil,Jeff" data-append-csv="mcneije01" data-stat="player"><strong><a href="/players/m/mcneije01.shtml">Jeff McNeil</a>*</strong></td>
<td  csk="Lindor,Francisco" data-append-csv="lindofr01" data-stat="player"><strong><a href="/players/l/lindofr01.shtml">Francisco Lindor</a>#</strong></td>...

And I was only looking for that data-append-csv=pUID. I appreciate the help though, I dug into some of the docs and was able to locate some stuff. I'm open to any suggestions on how to improve this.

CodePudding user response:

mccanja02 is an attribute value of data-append-csv. So you can't call .text to grab it . You can grab it using css selector as follows:

html='''
<html>
 <body>
  <tbody>
   <tr>
    <th  csk="1" data-stat="ranker" scope="row">
     1
    </th>
    <td  csk="1" data-stat="pos">
     <strong>
      C
     </strong>
    </td>
    <td  csk="McCann,James" data-append-csv="mccanja02" data-stat="player">       
     <strong>
      <a href="/players/m/mccanja02.shtml">
       James McCann
      </a>
     </strong>
    </td>
    <td  data-stat="age">
     32
    </td>
    <td  data-stat="G">
     13
    </td>
    <td  data-stat="PA">
     42
    </td>
    <td  data-stat="AB">
     36
    </td>
    <td  data-stat="R">
     5
    </td>
    <td  data-stat="H">
     7
    </td>
    <td  data-stat="2B">
     2
    </td>
    <td  data-stat="3B">
     0
    </td>
    <td  data-stat="HR">
     1
    </td>
    <td  data-stat="RBI">
     5
    </td>
    <td  data-stat="SB">
     1
    </td>
    <td  data-stat="CS">
     0
    </td>
    <td  data-stat="BB">
     2
    </td>
    <td  data-stat="SO">
     7
    </td>
    <td  data-stat="batting_avg">
     .194
    </td>
    <td  data-stat="onbase_perc">
     .286
    </td>
    <td  data-stat="slugging_perc">
     .333
    </td>
    <td  data-stat="onbase_plus_slugging">
     .619
    </td>
    <td  data-stat="onbase_plus_slugging_plus">
     87
    </td>
    <td  data-stat="TB">
     12
    </td>
    <td  data-stat="GIDP">
     1
    </td>
    <td  data-stat="HBP">
     3
    </td>
    <td  data-stat="SH">
     0
    </td>
    <td  data-stat="SF">
     1
    </td>
    <td  data-stat="IBB">
     0
    </td>
   </tr>
  </tbody>
 </body>
</html>
'''

from bs4 import BeautifulSoup
tableTeamBatting=BeautifulSoup(html,'lxml')
#print(soup.prettify())

rowsUID = tableTeamBatting.select('tr')
for rowUID in rowsUID:
    playerUID = rowUID.select_one('td[data-append-csv]')
    playerUID = playerUID.get('data-append-csv')if playerUID else None

    print(playerUID)

     

Output:

mccanja02
  • Related