I'm attempting to use the bs4 (Beautiful Soup 4) and requests libraries in python to scrape key data from websites for a work based project that I've been assigned. Although I've mainly got my web scraping program to work, I'm running into issues on certain websites such as Google.
The problem is the HTML that is being scraped by my program on certain websites is not matching the HTML as is being displayed in the developer tool Elements Panel on a lot of browsers.
Consider the following "visual testing" extract from my program:
from bs4 import BeautifulSoup
import requests
URL_source = r'https://google.com/search?q=stack overflow'
response_object = requests.get(URL_source).text
soup = BeautifulSoup(response_object, 'lxml')
soup = str(soup.prettify())
soup = soup.split('\n')
for i in range(20):
print(soup[i])
This program is supposed to scrape and print out the first 20 lines of the HTML underpinning the Google page that results from searching "stack overflow" in Google (https://google.com/search?q=stack overflow). The output that I get using the Spyder IDE is as follows:
<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<style nonce="RJFpNnOeRBbdIaYpv jsHw">
a, a:link, a:visited, a:active, a:hover {
color: #1a73e8;
text-decoration: none;
}
body {
font-family: Roboto,RobotoDraft,Helvetica,Arial,sans-serif;
text-align: center;
-ms-text-size-adjust: 100%;
-moz-text-size-adjust: 100%;
-webkit-text-size-adjust: 100%;
}
.box {
border: 1px solid #dadce0;
box-sizing: border-box;
border-radius: 8px;
margin: 24px auto 5px auto;
However, this HTML code appears to be very different from what Chrome's (and Microsoft Edge's) developer tool Elements Panel (keyboard shortcut: F12) suggests is the underlying HTML code: Results from Chrome's developer tool Elements Panel
Why is this discrepancy arising?
Any help you can offer will be greatly appreciated. I'm sure many of you will be able to see the source of the problem straight away. I suspect that the fact that the HTML extract that I've scraped is in a (mostly) json format might hold some clues.
Thank you.
CodePudding user response:
Try using another link as google is a search engine, its source code changes each time we search a query. Try using simple links like https://hacknetayush.repl.co
CodePudding user response:
Note Developer Tools operate on a live browser DOM, what you’ll see when inspecting the page source is not the original HTML, but a modified one after applying some browser clean up and executing JavaScript code.
Requests is not executing JavaScript so content can deviate slightly, but you can scrape - Just take a deeper look into your soup.