Home > Blockchain >  Python webscraping with bs4. Some things not work, probably my if statement. Code and description in
Python webscraping with bs4. Some things not work, probably my if statement. Code and description in

Time:09-10

At the beginning, I want to tell, I'm completely beginner. I though, what could be useful for me, and I created that thing. So, if you have the question "why do you use ... instead of ...", the answers is I don't know. Probably I saw this in documentation / Reddit / Stack. It's my resources for learning, so not each code snippet are created with 100% correct readability and good practice, because still small experience lock me.

I created a little script based on BS4. It shows me Google results based on a few Google dorks. Script work very well, but after many of hours testing, editing, checking many things I have the problem with few things.

Google results can give me one of three results, that I must put into my if statement. So, that results are:

  1. Normal Google results with links based on my dork (correct links),
  2. Information "No search results were found for the term ...", but Google prints additional links, something like suggests or similar (garbage, read it as no found result).
  3. Information "The given phrase ... has not been found." Google not found any results, that is ok.

I though how should I handle this, and I created an if statement. Unfortunately, it not work good.

if soup.select('.rQUFld'):
    print("No results, suggest garbage links")
if soup.select("a:has(h3)"):
    print("Found the right links")
    for a in soup.select("a:has(h3)"):
        print(a["href"])
else:
    print("No results, zero links.")

Important to know, ".rQUFld" is the name of a class, that contains "No search results were found for the term ...". I think, here is the main problem. In that case, if that class exist, Google show additional results that I not want. In that way, second if is executed too, instead only first. So output are information from first if, and links from second if (garbage links).

I want to obtain something similar to:

  1. If class exist, print "No results" (garbage links),
  2. If exist only links, print that links (correct links),
  3. Else, print "No results" (no links and class from first if).

From my test, second and third option work fine. (Google found only correct links, or not found any links)

Here is the full code:

import requests
import bs4

url = "https://www.google.com/search"
params = {"q": 'here_is_some_google_dork'}  
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"}
soup = bs4.BeautifulSoup(requests.get(url, params=params, headers=headers).content, "html.parser")

if soup.select('.rQUFld'):
    print("No results, suggest garbage links")
if soup.select("a:has(h3)"):
    print("Found the right links")
    for a in soup.select("a:has(h3)"):
        print(a["href"])
else:
    print("No results, zero links.")

In additional, I will add something:

  1. If class exist (my first option), I think it could be something else too, instead of rQUFld only. These things could work (full name): id="topstuff" OR OR my proposition . The rQUFld is inside previous class, that is inside topstuff.

  2. If not found any links, so it's my third option in if statement (just else), Google saw this at: div id="res" role="main">. But I print just "else" instead of class name. I think it's correct.

Thanks in advance for any responses. If post are chaotic, really sorry. It's my first steps here, give me some time. If some words are incomprehensible, sorry for that. English is not my main language, but I work on it.

CodePudding user response:

You have one elementary error, which is that the logic of your if statements doesn't match the logic in your head.

Based on your question, you want something like:

  • If first condition is satisfied, do x
  • Otherwise, check a second condition - if that's satisfied, do y
  • Otherwise, if neither of those two conditions are satisfied, do z

Your logic links all three in a chain - but if you restart with a new if in the middle, the chain is broken.

In Python, this plain language logic matches with:

  • if
  • elif
  • else

elif will only be tested if that first if condition fails. The final else condition will only be tested if everything else fails.

Now, if you want to add multiple conditions to your first if statement you can combine them with the or or and operators - symbolically represented with | and &. To keep things straight you can contain each one in brackets, like this:

if ("something" == "something") | ("something" == "something else"):
    print("something")

Result:

something

Here's the same thing using the and operator:

if ("something" == "something") & ("something" == "something else"):
    print("something")
else:
    print("nothing")

Result:

nothing

And here's elif in action:

if ("something" == "something else"):
    print("something else")
elif ("something" == "something") :
    print("something")
else:
    print("nothing")

Result:

something
  • Related