Scrape date and link from a HTML table where both items are separated by different tags-CodePudding

I have a long table with links and dates. See the HTML code below. The data in the table is well-structured: All the information appears in blocks, with a date (isodate) between dt tags and the link between dd tags.

Using Python and BeautifulSoup, I want to get the date and the link for each block.

<dt isodate="2022-02-10">
<div >10 February 2021</div>
</dt>
<dd>
<div ><span ><a  href="https://www.bla.html" lang="en"> <span >English</span></a></span></div>
</dd>

<dt isodate="2022-02-12">
<div >12 February 2021</div>
</dt>
<dd>
<div ><span ><a  href="https://www.bli.html" lang="en"> <span >English</span></a></span></div>
</dd>

How to achieve that?

CodePudding user response：

You can use beautiful soup find_all and find_next_sibling methods to select the dts and the sibling dds as below:

from bs4 import BeautifulSoup

html_doc = """
    <dt isodate="2022-02-10">
    <div >10 February 2021</div>
    </dt>
    <dd>
    <div ><span ><a  href="https://www.bla.html" lang="en"> <span >English</span></a></span></div>
    </dd>

    <dt isodate="2022-02-12">
    <div >12 February 2021</div>
    </dt>
    <dd>
    <div ><span ><a  href="https://www.bli.html" lang="en"> <span >English</span></a></span></div>
    </dd>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

items = []

for item in soup.find_all('dt'):
    date = item.get('isodate')
    url = item.find_next_sibling('dd').select_one('div a').get('href')
    # add a dict to the items list
    items.append(dict(date=date, url=url))

print(items)

output

[
{'date': '2022-02-10', 'url': 'https://www.bla.html'}, 
{'date': '2022-02-12', 'url': 'https://www.bli.html'}
]

CodePudding user response：

Here index.html consisted of

<html>
<body>

    <dt isodate="2022-02-10"><div >10 February 2021</div></dt>
    <dd>
        <div ><span ><a  href="https://www.bla.html" lang="en"> <span
            >English</span></a></span></div></dd>

    <dt isodate="2022-02-12"><div >12 February 2021</div></dt>
    <dd>
        <div ><span ><a  href="https://www.bli.html" lang="en"> <span
                        >English</span></a></span></div></dd>
</body>
</html>

Python code used:

from bs4 import BeautifulSoup

with open("index.html") as html_content:
    soup = BeautifulSoup(html_content, 'html.parser')
date = [x.get_text() for x in soup.find_all('div',class_='date')]
link = [x['href'] for x in soup.find_all('a',href=True)]

For Date:

find_all will find the div class = 'div' and return list of all the tags that match.

[<div >10 February 2021</div>, <div >12 February 2021</div>]

From here you can extract the date via x.get_text()

For Link:

find_all will find the a tag which contains href values (when href=True) and return a list of all tags that match.

[<a  href="https://www.bla.html" lang="en"> <span >English</span></a>, <a  href="https://www.bli.html" lang="en"> <span >English</span></a>]

From here you can extract the href data using x['href']

Output: A list of date and urls can be accessed separately.

['10 February 2021', '12 February 2021'] ['https://www.bla.html', 'https://www.bli.html']