I prepared three little scripts, that theoretically the should do the same, but two not work properly. I'm not sure what could be wrong. I used PyCharm, and packages was installed inside projects, not globally with PIP.
First script don't give me any results, just "Process finished with exit code 0".
import requests
import bs4
text = "Python"
url = 'https://google.com/search?q=' text
request_result = requests.get(url)
soup = bs4.BeautifulSoup(request_result.text, "html.parser")
heading_object = soup.find_all('h3')
for info in heading_object:
print(info.getText())
Second script same as above, only "Process finished with exit code 0".
import requests
import bs4
from urllib.parse import quote_plus
result = 'Python'
query = quote_plus(result)
link = f"https://www.google.com/search?q={query}"
request_result = requests.get(link)
soup = bs4.BeautifulSoup(request_result.text, "html.parser")
for p in soup.find_all('h3'):
print(p.text)
Third script work fine, I have result from Google search.
import requests
import bs4
url = "https://www.google.com/search"
params = {"q": "Python"}
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = bs4.BeautifulSoup(requests.get(url, params=params, headers=headers).content, "html.parser")
for a in soup.select("a:has(h3)"):
print(a["href"])
Can someone explain me please, what is not ok with scripts, that not worked? I asking, because theoretically they should work (they based on tutorial). Maybe exist better way than above to scraping Google results?
CodePudding user response:
I feel like stating the obvious, but the main difference between your scripts is specifying a browser's header. For instance, your first script with headers:
import requests
import bs4
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
text = "Python"
url = 'https://google.com/search?q=' text
request_result = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(request_result.text, "html.parser")
heading_object = soup.find_all('h3')
for info in heading_object:
print(info.getText())
Results:
Welcome to Python.org
Downloads
Python For Beginners
[...]
Headers are how the browser present itself when knocking on server's door: server can choose to accept, or deny the request.
CodePudding user response:
it wont work because heading_object
is an empty list. There are basically no h3
found.
so i changed to h2
and then h1
to show it works:
heading_object = soup.find_all('h1')
this is the code:
import requests
import bs4
text = "Python"
url = 'https://google.com/search?q=' text
request_result = requests.get(url)
soup = bs4.BeautifulSoup(request_result.text, "html.parser")
heading_object = soup.find_all('h1')
print(heading_object)
for info in heading_object:
print(info.getText())
this is the result (with the code):
[<h1>Before you continue to Google</h1>]
Before you continue to Google