I have to extract from different web data sheet like this the section with url of website.
The problem is that the class “vermell_nobullet” than has the href than I need its repeat at least twice.
How to extract the specific class “vermell_nobullet” with the href of website.
My code
from bs4 import BeautifulSoup
import lxml
import requests
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml") # Variable que filtre pel contigut lxml
return parsed_response
depPres = "http://sac.gencat.cat/sacgencat/AppJava/organisme_fitxa.jsp?codi=6"
print(depPres)
soup = parse_url(depPres)
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
referClass
Output that I have:
[<a href="https://ovt.gencat.cat/gsitfc/AppJava/generic/conqxsGeneric.do?webFormId=691">
Bústia electrònica
</a>,
<a href="http://presidencia.gencat.cat">http://presidencia.gencat.cat</a>]
Output that I want:
http://presidencia.gencat.cat
CodePudding user response:
You can put condition like if text
and href
is same from a
tag you can take
particular tag
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
for refer in referClass:
if refer.text==refer['href']:
print(refer['href'])
Another Way find last div
element and also find last href
using find_all
method
soup.find_all("div",class_="blockAdresa")[-1].find_all("a")[-1]['href']
Output:
'http://presidencia.gencat.cat'