I'm trying to create any logic so the script can print the result according to the priorities I set. In oppose to how I tried, the results are printed following the sequence.
To be more precise, I want to print the results as E
, D
, B
, and A
, regardless of how they appear in the list.
I've tried with:
items = ['A','B','D','E']
for item in items:
if item=="E":
print(item)
elif item=="D":
print(item)
elif item=="B":
print(item)
elif item=="A":
print(item)
Current output:
A
B
D
E
Expected output:
E
D
B
A
EDIT:
I created the demo above to get any suggestion so I can implement the same within the script below. The following script iterates through all the websites in the predefined list. From each of the sites, I wish to get the link associated with contact
button. In case there is no contact button, I'm willing to go for the link connected to about
button.
Here is how I have tried:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = [
"http://www.innovaprint.com.sg/",
"https://www.richardsonproperties.com/",
"https://www.thepunctuationguide.com/",
"http://www.knowledgeplatform.com/",
"http://www.singaporeenterpriseassociation.com/",
"https://www2.deloitte.com/sg/en.html"
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'
}
for link in links:
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
target_link = ''
for item in soup.select("a[href]"):
if "contact" in item.text.lower():
target_link = urljoin(link,item['href'])
break
elif "about" in item.text.lower():
target_link = urljoin(link,item['href'])
break
print(target_link)
The "about" link appears frequently when I run the script, even though the "contact" link is available.
CodePudding user response:
Is this want you want?
Change this line:
elif "about" in item.text.lower():
To this:
elif item.text.lower() in ["about", "about.html"]:
Then, running your code with the said change:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = [
"http://www.innovaprint.com.sg/",
"https://www.richardsonproperties.com/",
"https://www.thepunctuationguide.com/",
"http://www.knowledgeplatform.com/",
"http://www.singaporeenterpriseassociation.com/",
"https://www2.deloitte.com/sg/en.html"
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'
}
for link in links:
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text, "lxml")
target_link = ''
for item in soup.select("a[href]"):
if "contact" in item.text.lower():
target_link = urljoin(link, item['href'])
break
elif item.text.lower() in ["about", "about.html"]:
target_link = urljoin(link, item['href'])
break
print(target_link)
Produces:
http://www.innovaprint.com.sg/contact.html
https://www.richardsonproperties.com/contact-us/
mailto:[email protected]
http://www.knowledgeplatform.com/about/
http://www.singaporeenterpriseassociation.com/contact.html
https://www2.deloitte.com/sg/en/footerlinks/contact-us.html?icid=bn_contact-us
CodePudding user response:
You can prioritize contact link
over about link
in the following ways:
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
target_link = None
for item in soup.select("a[href]"):
if "contact" in item.text.lower():
target_link = urljoin(link,item['href'])
break
if target_link is None:
for item in soup.select("a[href]"):
if "about" in item.text.lower():
target_link = urljoin(link,item['href'])
break
print(target_link)