I am trying to use bs4 to parse html and find the href nested in the first p of div class:'entry-content', to return 'http://withplum.com/':
<p>
<a href="http://withplum.com/" target="_blank" rel="noreferrer noopener">Plum</a>
<p>
The problem is that the following sits right above the a href that I want, so the output of my code returns the wrong link:
<a target="_blank" class="single-post-ad" href="https://deallite.uk/#pricing"> <img src="//www.uktechnews.info/wp-content/uploads/2019/07/Subscribe-for-weekly-deal-aler-£5.32_month.png" alt="Digiqole ad">
</a>
Here is my code so far:
def extract_subpage(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
r = requests.get(url, headers)
soup_subpage = BeautifulSoup(r.content, 'html.parser')
return soup_subpage
def transform_subpage(soup_subpage):
for item in soup_subpage.find_all('div', class_ = 'post-body'):
subpage_link = item.find('a')['href']
subpage_link = {
'subpage_link': subpage_link
}
subpage.append(subpage_link)
return
subpage = []
for url in tqdm(df['link']):
t = extract_subpage(url)
transform_subpage(t)
Question is: how do I modify my code to capture only the p body, and within that search for the first a href value?
-- edit:
I've edited my code to the following:
def transform_subpage(soup_subpage):
for item in soup_subpage.select("div.entry-content.clearfix > p > a"):
subpage_link = item[0]['href']
subpage_link = {
'subpage_link': subpage_link
}
subpage.append(subpage_link)
return
subpage = []
But getting KeyError: 0
Does anyone know how to fix this?
CodePudding user response:
I have taken your data as HTML
you can use css-selector to find tag
html="""<div >
::before
<div >
::before
<a target="_blank" href="https://dealli
te.uk/#pricing">...</a>
<ul>...</ul>
<p>
<em>Friday 15th October 2021. London, UK. </em>
"Fast-growing fintech "
<a href="http://withplum.com/" target="_blank" rel="noreferr
er noopener">Plum</a> == $0
" is today announcing a first close of new funding that will
supercharge the company's expansion, and cement Plum as
Europe's ultimate money management app. "
</p>"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
main_data=soup.select("div.entry-content.clearfix > p > a")
main_data[0]['href']
Output:
'http://withplum.com/'
As per the Link:
import requests
res=requests.get("https://www.uktechnews.info/2021/10/13/humn-ai-secures-10-1-million-series-a-investment-led-by-bxr-group-and-shell-ventures/")
soup=BeautifulSoup(res.text,"html.parser")
Select element according to need
main_data=soup.select("div.entry-content.clearfix > p > a")
main_data[0]['href']
Output:
'http://humn.ai/'