I have a HTML file which looks about like this:
<div class="mon_title">[CURRENT DATE]</div>
<table class="mon_list" >[contents of the table]</table>
[OHER CODE]
<div class="mon_title">[ANOTHER DATE]</div>
<table class="mon_list" >[contents of another table]</table>
[repeats a few times over]
My end-goal is to extract the tables and somehow add the corresponding date to each.
Using this code I successfully extracted only the tables:
tables = soup.find_all("table", {"class": "mon_list"})
My question is how I can extract both the date and the table and somehow add the corresponding date to each table.
CodePudding user response:
find_all
support custom function, docs.
Here an example of usage
html = """<div >[CURRENT DATE]</div>
<table >[contents of the table]</table>
<div >[ANOTHER DATE]</div>
<table >[contents of another table]</table><span>hhhh</span>"""
import bs4
soup = bs4.BeautifulSoup(html, 'lxml')
def finder(tag1, tag2):
def _wrapper(tag):
if tag.name == tag1 or tag.name == tag2:
return True
return _wrapper
tags = soup.find_all(finder('table', 'div'))
print([tag.text if tag.name == 'div' else tag for tag in soup.find_all(finder('table', 'div'))])
Output
['[CURRENT DATE]', <table class="mon_list">[contents of the table]</table>, '[ANOTHER DATE]', <table class="mon_list">[contents of another table]</table>]
CodePudding user response:
You can do like this.
Select the
<table>
with the class name asmon_list
usingfind_all()
For each table selected above, since the date
<div>
is present before the<table>
element, you can select it using the.findPreviousSibling()
method..findPreviousSibling('div', class_='mon_title')
Here is the complete code that will print the date first and then the table data.
from bs4 import BeautifulSoup
s = """
<div >[CURRENT DATE]</div>
<table >[contents of the table]</table>
[OHER CODE]
<div >[ANOTHER DATE]</div>
<table >[contents of another table]</table>"""
soup = BeautifulSoup(s, 'lxml')
tabs = soup.find_all('table', class_='mon_list')
for tab in tabs:
date_div = tab.findPreviousSibling('div', class_='mon_title')
print(f"Date: {date_div.text.strip()}\nTable Data: {tab.text.strip()}\n")
Date: [CURRENT DATE]
Table Data: [contents of the table]
Date: [ANOTHER DATE]
Table Data: [contents of another table]