Home > OS >  Web scraping extract n-number of time the same thing
Web scraping extract n-number of time the same thing

Time:12-15

I have this code to scraping all sections of different website that has section web with the word "transparencia".

However, I do not know why when code print all url with the word filter, it's repeat n-number of time. And I only need one of them.

Input

from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook
import requests

listUrl = ["http://agricultura.gencat.cat","http://cultura.gencat.cat",
           "https://dretssocials.gencat.cat","http://economia.gencat.cat",
          "https://educacio.gencat.cat","http://empresa.gencat.cat",
          "http://interior.gencat.cat","http://justicia.gencat.cat",
          "https://presidencia.gencat.cat","https://salutweb.gencat.cat",
           "https://politiquesdigitals.gencat.cat","https://territori.gencat.cat"]


herfList=[]
codiNum = 0
keyWord = "transparencia"

def parse_url(url):
    response = requests.get(url)
    content = response.content
    parsed_response = BeautifulSoup(content, "lxml")
    return parsed_response

def extract_post_data (url):
    
    soup_url = parse_url(url)
    
    try:
        herf_transparencia = soup_url.find_all('a', href =True)
    except:
        herf_transparencia = ""
    
    dadesUrlDic= {"herf transparencia": herf_transparencia}
    
    return dadesUrlDic



for url in listUrl:
    soup = parse_url(url)
    referenceHref = soup.find_all(class_= "NG-megamenu__nav-link-self", href= True)
    
    for href in referenceHref:
        
            if href.text:
                herfList.append(href['href'])
                
                for i in herfList:
                    
                    if keyWord.lower() in i.lower(): 
                        urlWithKeyWord = url   i
                        print(urlWithKeyWord)


Output

For example in this output, the url with world "transparencia" repeat and increase for every web section. But more to more when there aren't more web section the code continue print a few more line with all url.

http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/funcio-publica/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/


CodePudding user response:

How to fix?

Try to focus and select more specific for example with css-selectors and check if the url is already in your list of urls:

for a in soup.select(f'a[href*={keyWord}]'):
    a=url a["href"]
    if a not in hrefList:
        hrefList.append(a)

Example

Note that it is focused to your question and do not contains all of your code.

from bs4 import BeautifulSoup
import requests

listUrl = ["http://agricultura.gencat.cat","http://cultura.gencat.cat",
           "https://dretssocials.gencat.cat","http://economia.gencat.cat",
          "https://educacio.gencat.cat","http://empresa.gencat.cat",
          "http://interior.gencat.cat","http://justicia.gencat.cat",
          "https://presidencia.gencat.cat","https://salutweb.gencat.cat",
           "https://politiquesdigitals.gencat.cat","https://territori.gencat.cat"]


hrefList=[]
keyWord = "transparencia"

def parse_url(url):
    response = requests.get(url)
    content = response.content
    parsed_response = BeautifulSoup(content, "lxml")
    return parsed_response

for url in listUrl:
    soup = parse_url(url)

    for a in soup.select(f'a[href*={keyWord}]'):
        a=url a["href"]
        if a not in hrefList:
            hrefList.append(a)

hrefList

Output

['http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/',
 'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/',
 'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/cataleg-serveis/',
 'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/normativa/',
 'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/actuacions-administratives-juridiques/',
 'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/',
 'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/auditories-serveis-publics/',
 'http://agricultura.gencat.cathttp://governobert.gencat.cat/ca/transparencia/Gestio-serveis-publics/Estudis-de-politiques-publiques-e-danalisi-comparada_/',
...]
  • Related