I want to scrape several columns of text contained in td tags with a common css attribute inside of tr with a common css attribute inside of a table with a specific class inside of a div
For example, this is exactly how the website is structured.
<div 0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
.
.
.
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
I want to get the texts enclosed in the td tags
I have tried solving this problem by writing the code below
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
data = soup.select(".stats_table")
all_data = [l.get_text(strip=True) for l in soup.select(".stats_table:has(> [data-row])")]
print(all_data)
But when I try to execute this code, I get an empty list. I need your help on this matter, thanks.
CodePudding user response:
To get text inside the td tags, you can use stripped_strings method
from bs4 import BeautifulSoup
html='''
<div >
<table >
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
'''
all_data=[]
soup = BeautifulSoup(html,'lxml')
for tr in soup.select('table.stats_table tr'):
print(list(tr.stripped_strings))
Output:
['38', '29', '6', '3', '93']
['38', '28', '8', '2', '92']
['38', '5', '7', '26', '22']
CodePudding user response:
I see that you specifically want the element "which has a table class stats-table". There are 2 ways in which you can do this.
- Using regex
import re
html = '''
<div >
<table >
<tbody>
<tr data-row="0">
<td data-stat="games">38</td>
<td data-stat="wins">29</td>
<td data-stat="draws">6</td>
<td data-stat="losses">3</td>
<td data-stat="points">93</td>
</tr>
<tr data-row="1">
<td data-stat="games">38</td>
<td data-stat="wins">28</td>
<td data-stat="draws">8</td>
<td data-stat="losses">2</td>
<td data-stat="points">92</td>
</tr>
<tr data-row="19">
<td data-stat="games">38</td>
<td data-stat="wins">5</td>
<td data-stat="draws">7</td>
<td data-stat="losses">26</td>
<td data-stat="points">22</td>
</tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
datastats = soup.find_all("td", {"data-stat" : re.compile(r".*")})
for stat in datastats:
print(stat.text)
which gives us the expected output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
- Using CSS Selector
The below selector means that select all the
td
tags that has an attributedata-stat
inside the table that has classstats_table
datastats = soup.select("table.stats_table td[data-stat]")
for stat in datastats:
print(stat.text)
which gives us the same output
[38,29,6,3,93,38,28,8,2,92,38,5,7,26,22]
You can find more information on CSS_SELECTOR here