Home > other >  Select two HTML elements at the same time in BeautifulSoup?
Select two HTML elements at the same time in BeautifulSoup?

Time:10-28

I have a HTML file which looks about like this:

<div class="mon_title">[CURRENT DATE]</div>
<table class="mon_list" >[contents of the table]</table>
[OHER CODE]
<div class="mon_title">[ANOTHER DATE]</div>
<table class="mon_list" >[contents of another table]</table>
[repeats a few times over]

My end-goal is to extract the tables and somehow add the corresponding date to each.

Using this code I successfully extracted only the tables:

tables = soup.find_all("table", {"class": "mon_list"})

My question is how I can extract both the date and the table and somehow add the corresponding date to each table.

CodePudding user response:

find_all support custom function, docs.

Here an example of usage

html = """<div >[CURRENT DATE]</div>
<table  >[contents of the table]</table>
<div >[ANOTHER DATE]</div>
<table  >[contents of another table]</table><span>hhhh</span>"""

import bs4

soup = bs4.BeautifulSoup(html, 'lxml')

def finder(tag1, tag2):
    def _wrapper(tag):
        if tag.name == tag1 or tag.name == tag2:
            return True   
    return _wrapper

tags = soup.find_all(finder('table', 'div'))

print([tag.text if tag.name == 'div' else tag for tag in soup.find_all(finder('table', 'div'))])

Output

['[CURRENT DATE]', <table class="mon_list">[contents of the table]</table>, '[ANOTHER DATE]', <table class="mon_list">[contents of another table]</table>]

CodePudding user response:

You can do like this.

  • Select the <table> with the class name as mon_list using find_all()

  • For each table selected above, since the date <div> is present before the <table> element, you can select it using the .findPreviousSibling() method.

    .findPreviousSibling('div', class_='mon_title')
    

Here is the complete code that will print the date first and then the table data.

from bs4 import BeautifulSoup
s = """
<div >[CURRENT DATE]</div>
<table  >[contents of the table]</table>
[OHER CODE]
<div >[ANOTHER DATE]</div>
<table  >[contents of another table]</table>"""

soup = BeautifulSoup(s, 'lxml')
tabs = soup.find_all('table', class_='mon_list')
for tab in tabs:
    date_div = tab.findPreviousSibling('div', class_='mon_title')
    print(f"Date: {date_div.text.strip()}\nTable Data: {tab.text.strip()}\n")
Date: [CURRENT DATE]
Table Data: [contents of the table]

Date: [ANOTHER DATE]
Table Data: [contents of another table]
  • Related