Home > Back-end >  BeautifulSoup scraping the specific href that has a class that it's repeat at least twice
BeautifulSoup scraping the specific href that has a class that it's repeat at least twice

Time:12-24

I have to extract from different web data sheet like this the section with url of website.

The problem is that the class “vermell_nobullet” than has the href than I need its repeat at least twice.

How to extract the specific class “vermell_nobullet” with the href of website.

My code

from bs4 import BeautifulSoup
import lxml
import requests

def parse_url(url): 
    response = requests.get(url) 
    content = response.content  
    parsed_response = BeautifulSoup(content, "lxml") # Variable que filtre pel contigut lxml
    return parsed_response 


depPres = "http://sac.gencat.cat/sacgencat/AppJava/organisme_fitxa.jsp?codi=6"

print(depPres)

soup = parse_url(depPres)

referClass = soup.find_all("a", {"class":"vermell_nobullet"})

referClass



Output that I have:

[<a  href="https://ovt.gencat.cat/gsitfc/AppJava/generic/conqxsGeneric.do?webFormId=691">
                            Bústia electrònica
                        </a>,
 <a  href="http://presidencia.gencat.cat">http://presidencia.gencat.cat</a>]

Output that I want:

http://presidencia.gencat.cat

CodePudding user response:

You can put condition like if text and href is same from a tag you can take particular tag

referClass = soup.find_all("a", {"class":"vermell_nobullet"})

for refer in referClass:
    if refer.text==refer['href']:
        print(refer['href'])

Another Way find last div element and also find last href using find_all method

soup.find_all("div",class_="blockAdresa")[-1].find_all("a")[-1]['href']

Output:

'http://presidencia.gencat.cat'
        
  • Related