I tried to search all websites in Google that end with "gencat.cat".
My code:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'gencat.cat'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# containver with all needed data
for result in soup.select('.tF2Cxc'):
link = result.a['href'] # or ('.yuRUbf a')['href']
print(link)
Output that I have:
The problem is that only a few websites are searched and also it takes some urls without "gencat.cat" in them or repeats pages from the same site:
https://web.gencat.cat/ca/inici
https://web.gencat.cat/es/inici/
https://web.gencat.cat/ca/tramits
https://web.gencat.cat/en/inici/index.html
https://govern.cat/
https://govern.cat/salapremsa/
http://www.gencat.es/
http://www.regencos.cat/promocio-variable/preguntes-mes-frequents-sobre-el-coronavirus/
https://tauler.seu.cat/inici.do?idens=1
Output that I want:
https://web.gencat.cat
http://agricultura.gencat.cat
http://cultura.gencat.cat
https://dretssocials.gencat.cat
http://economia.gencat.cat
CodePudding user response:
If you are wanting the top-level domain, you can split the link on all the instances of "/" in the link
variable.
for result in soup.select('.tF2Cxc'):
link = result.a['href'] # or ('.yuRUbf a')['href']
print(link)
string_splt = link.split("/")
TLD = f"https://{string_splt[2]}"
print(TLD)
I am sure there is a better way to bring it all back together but this seems to work. You will also need to handle the duplicates as well.