How to set default value for non existing <a> or href in list comprehension?-CodePudding

I am using bs4 for extracting data from HTML files with following list comprehension:

columns = row.find_all('td')
print([(td.text,td.a.get('href')) for td in columns])

Not all the <td> has a href, so the code does not fly.

How can I maintain the list comprehension stile while adding a default value '' if the tag does not have attribute href?

NOTE:

As example: every of those columns HTML is like this:

<pre><code>
<td  ><strong>12.649.000,18</strong></td> <td  ><a href="/whatever/1.html" target="_blank">Brussels</a></td> <td  ><a href="/whatever/2.html" target="_blank">Belgium</a></td> <td  >blue</td>
<pre><code>

CodePudding user response：

If you want the second item in the tuple to be an empty string by default, you can do what HedgeHog suggested but in a slightly different way:

file.html

<pre><code>
<td  >
  <strong>12.649.000,18</strong>
</td> 
<td  >
  <a href="/whatever/1.html" target="_blank">Brussels</a>
</td> 
<td  >
  <a href="/whatever/2.html" target="_blank">Belgium</a>
</td> 
<td  >blue</td>
</code></pre>

parser.py

from bs4 import BeautifulSoup

with open("file.html", "r", encoding="utf-8") as html_doc:
    soup = BeautifulSoup(html_doc, "html.parser")

columns = soup.find_all("td")


def parse_cell(td) -> tuple[str, str]:
    return td.text.strip(), "" if td.a is None else td.a.get("href")


parsed = [parse_cell(td) for td in columns]
print(parsed)

Output

$ python parser.py
[('12.649.000,18', ''), ('Brussels', '/whatever/1.html'), ('Belgium', '/whatever/2.html'), ('blue', '')]

Also, I suggest separating the code out like I did so that it's more readable what you're actually doing.

CodePudding user response：

EDIT

Based on your comment I get the point, what expected output should look like - So you have to put the if-statement into last position in your tuple:

[(e.text, '' if not e.a or not e.a.get('href') else e.a.get('href')) for e in soup.select('td')]

Note: You have to check for both if not e.a or not e.a.get('href') in case that there is an <a> without href attribute this will lead to None value. In terms of readability, the test should probably be outsourced to a function, as also mentioned by @MagicalCornFlake

Example

from bs4 import BeautifulSoup
html ='''
<td  ><strong>12.649.000,18</strong></td> <td  ><a href="/whatever/1.html" target="_blank">Brussels</a></td> <td  ><a href="/whatever/2.html" target="_blank">Belgium</a></td> <td  >blue</td>
<td  ><strong>12.649.000,18</strong></td> <td  ><a ref="/whatever/1.html" target="_blank">Brussels</a></td> <td  ><a ref="/whatever/2.html" target="_blank">Belgium</a></td> <td  >blue</td>
<td  ><strong>12.649.000,18</strong></td> <td  >Brussels</td> <td  >Belgium</td> <td  >blue</td>
'''
soup = BeautifulSoup(html)

print([(e.text, '' if not e.a or not e.a.get('href') else e.a.get('href')) for e in soup.select('td')])

Output

[('12.649.000,18', ''), ('Brussels', '/whatever/1.html'), ('Belgium', '/whatever/2.html'), ('blue', ''), ('12.649.000,18', ''), ('Brussels', ''), ('Belgium', ''), ('blue', ''), ('12.649.000,18', ''), ('Brussels', ''), ('Belgium', ''), ('blue', '')]