I have this code to scraping all sections of different website that has section web with the word "transparencia".
However, I do not know why when code print all url with the word filter, it's repeat n-number of time. And I only need one of them.
Input
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook
import requests
listUrl = ["http://agricultura.gencat.cat","http://cultura.gencat.cat",
"https://dretssocials.gencat.cat","http://economia.gencat.cat",
"https://educacio.gencat.cat","http://empresa.gencat.cat",
"http://interior.gencat.cat","http://justicia.gencat.cat",
"https://presidencia.gencat.cat","https://salutweb.gencat.cat",
"https://politiquesdigitals.gencat.cat","https://territori.gencat.cat"]
herfList=[]
codiNum = 0
keyWord = "transparencia"
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml")
return parsed_response
def extract_post_data (url):
soup_url = parse_url(url)
try:
herf_transparencia = soup_url.find_all('a', href =True)
except:
herf_transparencia = ""
dadesUrlDic= {"herf transparencia": herf_transparencia}
return dadesUrlDic
for url in listUrl:
soup = parse_url(url)
referenceHref = soup.find_all(class_= "NG-megamenu__nav-link-self", href= True)
for href in referenceHref:
if href.text:
herfList.append(href['href'])
for i in herfList:
if keyWord.lower() in i.lower():
urlWithKeyWord = url i
print(urlWithKeyWord)
Output
For example in this output, the url with world "transparencia" repeat and increase for every web section. But more to more when there aren't more web section the code continue print a few more line with all url.
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/funcio-publica/
http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/
CodePudding user response:
How to fix?
Try to focus and select more specific for example with css-selectors
and check if the url is already in your list of urls:
for a in soup.select(f'a[href*={keyWord}]'):
a=url a["href"]
if a not in hrefList:
hrefList.append(a)
Example
Note that it is focused to your question and do not contains all of your code.
from bs4 import BeautifulSoup
import requests
listUrl = ["http://agricultura.gencat.cat","http://cultura.gencat.cat",
"https://dretssocials.gencat.cat","http://economia.gencat.cat",
"https://educacio.gencat.cat","http://empresa.gencat.cat",
"http://interior.gencat.cat","http://justicia.gencat.cat",
"https://presidencia.gencat.cat","https://salutweb.gencat.cat",
"https://politiquesdigitals.gencat.cat","https://territori.gencat.cat"]
hrefList=[]
keyWord = "transparencia"
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml")
return parsed_response
for url in listUrl:
soup = parse_url(url)
for a in soup.select(f'a[href*={keyWord}]'):
a=url a["href"]
if a not in hrefList:
hrefList.append(a)
hrefList
Output
['http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/',
'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/',
'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/cataleg-serveis/',
'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/normativa/',
'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/normativa-organitzacio/actuacions-administratives-juridiques/',
'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/',
'http://agricultura.gencat.cat/ca/departament/transparencia-i-bon-govern/gestio-serveis-publics/auditories-serveis-publics/',
'http://agricultura.gencat.cathttp://governobert.gencat.cat/ca/transparencia/Gestio-serveis-publics/Estudis-de-politiques-publiques-e-danalisi-comparada_/',
...]