Home > OS >  Failed to print the result according to the priorities I set
Failed to print the result according to the priorities I set

Time:09-30

I'm trying to create any logic so the script can print the result according to the priorities I set. In oppose to how I tried, the results are printed following the sequence.

To be more precise, I want to print the results as E, D, B, and A, regardless of how they appear in the list.

I've tried with:

items = ['A','B','D','E']

for item in items:
    if item=="E":
        print(item)
    elif item=="D":
        print(item)
    elif item=="B":
        print(item)
    elif item=="A":
        print(item)

Current output:

A
B
D
E

Expected output:

E
D
B
A

EDIT:

I created the demo above to get any suggestion so I can implement the same within the script below. The following script iterates through all the websites in the predefined list. From each of the sites, I wish to get the link associated with contact button. In case there is no contact button, I'm willing to go for the link connected to about button.

Here is how I have tried:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

links = [
    "http://www.innovaprint.com.sg/",
    "https://www.richardsonproperties.com/",
    "https://www.thepunctuationguide.com/",
    "http://www.knowledgeplatform.com/",
    "http://www.singaporeenterpriseassociation.com/",
    "https://www2.deloitte.com/sg/en.html"
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'
}

for link in links:
    res = requests.get(link,headers=headers)
    soup = BeautifulSoup(res.text,"lxml")
    target_link = ''
    for item in soup.select("a[href]"):
        if "contact" in item.text.lower():
            target_link = urljoin(link,item['href'])
            break

        elif "about" in item.text.lower():
            target_link = urljoin(link,item['href'])
            break

    print(target_link)

The "about" link appears frequently when I run the script, even though the "contact" link is available.

CodePudding user response:

Is this want you want?

Change this line:

        elif "about" in item.text.lower():

To this:

        elif item.text.lower() in ["about", "about.html"]:

Then, running your code with the said change:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

links = [
    "http://www.innovaprint.com.sg/",
    "https://www.richardsonproperties.com/",
    "https://www.thepunctuationguide.com/",
    "http://www.knowledgeplatform.com/",
    "http://www.singaporeenterpriseassociation.com/",
    "https://www2.deloitte.com/sg/en.html"
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'
}

for link in links:
    res = requests.get(link, headers=headers)
    soup = BeautifulSoup(res.text, "lxml")
    target_link = ''
    for item in soup.select("a[href]"):
        if "contact" in item.text.lower():
            target_link = urljoin(link, item['href'])
            break

        elif item.text.lower() in ["about", "about.html"]:
            target_link = urljoin(link, item['href'])
            break
    print(target_link)

Produces:

http://www.innovaprint.com.sg/contact.html
https://www.richardsonproperties.com/contact-us/
mailto:[email protected]
http://www.knowledgeplatform.com/about/
http://www.singaporeenterpriseassociation.com/contact.html
https://www2.deloitte.com/sg/en/footerlinks/contact-us.html?icid=bn_contact-us

CodePudding user response:

You can prioritize contact link over about link in the following ways:

res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
target_link = None

for item in soup.select("a[href]"):
    if "contact" in item.text.lower():
        target_link = urljoin(link,item['href'])
        break

if target_link is None:
    for item in soup.select("a[href]"):
        if "about" in item.text.lower():
            target_link = urljoin(link,item['href'])
            break

print(target_link)
  • Related