I am using bs4
for extracting data from HTML files with following list comprehension
:
columns = row.find_all('td')
print([(td.text,td.a.get('href')) for td in columns])
Not all the <td>
has a href
, so the code does not fly.
How can I maintain the list comprehension
stile while adding a default value ''
if the tag does not have attribute href
?
NOTE:
As example: every of those columns HTML is like this:
<pre><code>
<td ><strong>12.649.000,18</strong></td> <td ><a href="/whatever/1.html" target="_blank">Brussels</a></td> <td ><a href="/whatever/2.html" target="_blank">Belgium</a></td> <td >blue</td>
<pre><code>
CodePudding user response:
If you want the second item in the tuple to be an empty string by default, you can do what HedgeHog suggested but in a slightly different way:
file.html
<pre><code>
<td >
<strong>12.649.000,18</strong>
</td>
<td >
<a href="/whatever/1.html" target="_blank">Brussels</a>
</td>
<td >
<a href="/whatever/2.html" target="_blank">Belgium</a>
</td>
<td >blue</td>
</code></pre>
parser.py
from bs4 import BeautifulSoup
with open("file.html", "r", encoding="utf-8") as html_doc:
soup = BeautifulSoup(html_doc, "html.parser")
columns = soup.find_all("td")
def parse_cell(td) -> tuple[str, str]:
return td.text.strip(), "" if td.a is None else td.a.get("href")
parsed = [parse_cell(td) for td in columns]
print(parsed)
Output
$ python parser.py
[('12.649.000,18', ''), ('Brussels', '/whatever/1.html'), ('Belgium', '/whatever/2.html'), ('blue', '')]
Also, I suggest separating the code out like I did so that it's more readable what you're actually doing.
CodePudding user response:
EDIT
Based on your comment I get the point, what expected output should look like - So you have to put the if-statement
into last position in your tuple
:
[(e.text, '' if not e.a or not e.a.get('href') else e.a.get('href')) for e in soup.select('td')]
Note: You have to check for both if not e.a or not e.a.get('href')
in case that there is an <a>
without href
attribute this will lead to None
value. In terms of readability, the test should probably be outsourced to a function, as also mentioned by @MagicalCornFlake
Example
from bs4 import BeautifulSoup
html ='''
<td ><strong>12.649.000,18</strong></td> <td ><a href="/whatever/1.html" target="_blank">Brussels</a></td> <td ><a href="/whatever/2.html" target="_blank">Belgium</a></td> <td >blue</td>
<td ><strong>12.649.000,18</strong></td> <td ><a ref="/whatever/1.html" target="_blank">Brussels</a></td> <td ><a ref="/whatever/2.html" target="_blank">Belgium</a></td> <td >blue</td>
<td ><strong>12.649.000,18</strong></td> <td >Brussels</td> <td >Belgium</td> <td >blue</td>
'''
soup = BeautifulSoup(html)
print([(e.text, '' if not e.a or not e.a.get('href') else e.a.get('href')) for e in soup.select('td')])
Output
[('12.649.000,18', ''), ('Brussels', '/whatever/1.html'), ('Belgium', '/whatever/2.html'), ('blue', ''), ('12.649.000,18', ''), ('Brussels', ''), ('Belgium', ''), ('blue', ''), ('12.649.000,18', ''), ('Brussels', ''), ('Belgium', ''), ('blue', '')]