Home > OS >  How to select <tr> tags inside of a div with aspecific css attrbute via beautifulsoup?
How to select <tr> tags inside of a div with aspecific css attrbute via beautifulsoup?

Time:07-24

I want to scrape several columns of text contained in td tags with a common css attribute inside of tr with a common css attribute inside of a table with a specific class inside of a div

For example, this is exactly how the website is structured.

<div 0">
            <td data-stat="games">38</td>
            <td data-stat="wins">29</td>
            <td data-stat="draws">6</td>
            <td data-stat="losses">3</td>
            <td data-stat="points">93</td>
         </tr>
         <tr data-row="1">
            <td data-stat="games">38</td>
            <td data-stat="wins">28</td>
            <td data-stat="draws">8</td>
            <td data-stat="losses">2</td>
            <td data-stat="points">92</td>
         </tr>
         .
         .
         .
         <tr data-row="19">
            <td data-stat="games">38</td>
           <td data-stat="wins">5</td>
           <td data-stat="draws">7</td>
           <td data-stat="losses">26</td>
           <td data-stat="points">22</td>
         </tr>
       </tbody>
   </table>
</div>

I want to get the texts enclosed in the td tags

I have tried solving this problem by writing the code below

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.select(".stats_table")
all_data = [l.get_text(strip=True) for l in soup.select(".stats_table:has(> [data-row])")]
print(all_data)

But when I try to execute this code, I get an empty list. I need your help on this matter, thanks.

CodePudding user response:

To get text inside the td tags, you can use stripped_strings method

from bs4 import BeautifulSoup 
html='''
<div >
   <table >
      <tbody>
         <tr data-row="0">
            <td data-stat="games">38</td>
            <td data-stat="wins">29</td>
            <td data-stat="draws">6</td>
            <td data-stat="losses">3</td>
            <td data-stat="points">93</td>
         </tr>
         <tr data-row="1">
            <td data-stat="games">38</td>
            <td data-stat="wins">28</td>
            <td data-stat="draws">8</td>
            <td data-stat="losses">2</td>
            <td data-stat="points">92</td>
         </tr>
        
         <tr data-row="19">
            <td data-stat="games">38</td>
           <td data-stat="wins">5</td>
           <td data-stat="draws">7</td>
           <td data-stat="losses">26</td>
           <td data-stat="points">22</td>
         </tr>
       </tbody>
   </table>
</div>
'''
all_data=[]
soup = BeautifulSoup(html,'lxml')
for tr in soup.select('table.stats_table tr'):
    print(list(tr.stripped_strings))
    

Output:

['38', '29', '6', '3', '93']
['38', '28', '8', '2', '92']
['38', '5', '7', '26', '22']

CodePudding user response:

I see that you specifically want the element "which has a table class stats-table". There are 2 ways in which you can do this.

  • Using regex

import re

html = '''
<div >
   <table >
      <tbody>
         <tr data-row="0">
            <td data-stat="games">38</td>
            <td data-stat="wins">29</td>
            <td data-stat="draws">6</td>
            <td data-stat="losses">3</td>
            <td data-stat="points">93</td>
         </tr>
         <tr data-row="1">
            <td data-stat="games">38</td>
            <td data-stat="wins">28</td>
            <td data-stat="draws">8</td>
            <td data-stat="losses">2</td>
            <td data-stat="points">92</td>
         </tr>
         <tr data-row="19">
            <td data-stat="games">38</td>
           <td data-stat="wins">5</td>
           <td data-stat="draws">7</td>
           <td data-stat="losses">26</td>
           <td data-stat="points">22</td>
         </tr>
       </tbody>
   </table>
</div>
'''

soup = BeautifulSoup(html, "html.parser")
datastats = soup.find_all("td", {"data-stat" : re.compile(r".*")})
for stat in datastats:
    print(stat.text)

which gives us the expected output

[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
  • Using CSS Selector The below selector means that select all the td tags that has an attribute data-stat inside the table that has class stats_table

datastats  = soup.select("table.stats_table td[data-stat]")
for stat in datastats:
    print(stat.text)

which gives us the same output

[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]

You can find more information on CSS_SELECTOR here

  • Related