Home > OS >  python/beautifulsoup to find <a href> in <p> of specific <div class>
python/beautifulsoup to find <a href> in <p> of specific <div class>

Time:10-20

I am trying to use bs4 to parse html and find the href nested in the first p of div class:'entry-content', to return 'http://withplum.com/':

<p>
<a href="http://withplum.com/" target="_blank" rel="noreferrer noopener">Plum</a>
<p>

The problem is that the following sits right above the a href that I want, so the output of my code returns the wrong link:

<a target="_blank" class="single-post-ad" href="https://deallite.uk/#pricing"> <img src="//www.uktechnews.info/wp-content/uploads/2019/07/Subscribe-for-weekly-deal-aler-£5.32_month.png" alt="Digiqole ad">
</a>

here is the html screenshot

Here is my code so far:

def extract_subpage(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    r = requests.get(url, headers)
    soup_subpage = BeautifulSoup(r.content, 'html.parser')
    
    return soup_subpage

def transform_subpage(soup_subpage):
    for item in soup_subpage.find_all('div', class_ = 'post-body'):
        subpage_link = item.find('a')['href']
        
        subpage_link = {
            'subpage_link': subpage_link
        }
        subpage.append(subpage_link)
    return

subpage = []

for url in tqdm(df['link']):
    t = extract_subpage(url)
    transform_subpage(t)

Question is: how do I modify my code to capture only the p body, and within that search for the first a href value?

-- edit:

I've edited my code to the following:

def transform_subpage(soup_subpage):
    for item in soup_subpage.select("div.entry-content.clearfix > p > a"):
        subpage_link = item[0]['href']
        
        subpage_link = {
            'subpage_link': subpage_link
        }
        subpage.append(subpage_link)
    return

subpage = []

But getting KeyError: 0

Does anyone know how to fix this?

CodePudding user response:

I have taken your data as HTML you can use css-selector to find tag

html="""<div >
::before
<div >
::before
<a target="_blank"  href="https://dealli
te.uk/#pricing">...</a>
<ul>...</ul>
<p>
<em>Friday 15th October 2021. London, UK. </em>
"Fast-growing fintech "
<a href="http://withplum.com/" target="_blank" rel="noreferr
er noopener">Plum</a> == $0
" is today announcing a first close of new funding that will
supercharge the company's expansion, and cement Plum as
Europe's ultimate money management app.&nbsp;"
</p>"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")

main_data=soup.select("div.entry-content.clearfix > p > a")
main_data[0]['href']

Output:

'http://withplum.com/'

As per the Link:

import requests
res=requests.get("https://www.uktechnews.info/2021/10/13/humn-ai-secures-10-1-million-series-a-investment-led-by-bxr-group-and-shell-ventures/")
soup=BeautifulSoup(res.text,"html.parser")

Select element according to need

main_data=soup.select("div.entry-content.clearfix > p > a")
main_data[0]['href']

Output:

'http://humn.ai/'
  • Related