Home > Mobile >  Requests Module not fetching full website in Python
Requests Module not fetching full website in Python

Time:06-27

Sorry for a Noob question.... I have written a code which searches google for an image stored locally on my computer. I accomplished this using the requests module. I want to scrape the result page for information about the image but request module never fetches the entire page. It only fetches a part of it and thus I am not able to scrape the website for results

import requests
import webbrowser
from bs4 import BeautifulSoup

filePath = "C:\\Users\\mjjha\\Documents\\Checkrow\\monaLisa.jpg"
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']

r=requests.get(fetchUrl)
webbrowser.open(fetchUrl)
soup=BeautifulSoup(r.content,'html.parser')
head=soup.find_all('a')
for i in head:
    print(i['href'])

The web page looks like this: enter image description here

but when I scrape it for anchor tag links using beautiful soup I get the following result:

http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp?tbs=sbi:AMhZZisFTqPOZmEYpGB89rRLg4R4TfmF3WVQ_1gHEFiENQ8wbYqQq7-KsJUE5KuuxvINd0hFo10EMmP4RzWvOBvRxmsHZ7vm6etW-I36-QfCwmwir1NawORzsWZJffCnSwTpdts39mmQ1EfkcH0R8eGsiJ4_1Xw9DA_1C9mqLpChwRYdgOT-oFNcpt2O25Zhmo6ouG2XA5ZelCbKAChT4DJfGz0TXphXB_1dGEluDV6R_15n42URKCX5Q1zIqR6_16CR0rgXBphz95FMrETLqtPURRbAaWzauYisnSk6jF_1T5GbuoJKHtqThXevhogUSW9ERfZr5vbbWI6DA9c&ec=GAZAAQ
/search?ie=UTF-8&q=Anne Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY=&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=hr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAU    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=hi&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAY    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=bn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAc    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=te&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAg    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=mr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAk    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=ta&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=gu&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAs    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=kn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAw    
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=ml&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=pa&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA4    
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_fWA_BOpuaXy87gYOc9cg4jE6KwU=
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp?tbs=sbi:AMhZZiuVpzNQ_10ZbUunarcxtcBADLP2GPHTlIiPAtNpsOezRg48a1S4ofT8Df9C4NRT_1PzMk4baDStlOQGRp2okPHCALA5TvLodRqGj_1Q9_19KWykusySDjQNkbi67Ob6Kx7LZ0ybQ59c9mvyda27CBq8_19XutgXXgl4hGCLdXX9M3Od0WI9BckSHxv_1zajCMhKj1XaLKl9p7T9S0hfQbyZs4zQNcudXEk_1y3Zle6anU1rmSpEdpeCXC6r_1HnTnTLAtYWQlVFVF6QuCT8W5djGXGXwTjQH7NgkXnOi6q7v4F_1DqTVytnSAcBX6rc1eJFlXwIR2dzR73cs983mzb686VgOqUUNS1IG8w&ec=GAZAAQ
/search?ie=UTF-8&q=Anne Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY=&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=hr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAU    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=hi&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAY    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=bn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAc    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=te&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAg    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=mr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAk    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=ta&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAo    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=gu&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAs    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=kn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAw    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=ml&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA0    
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=pa&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA4    
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_d19wKMnK5qQH_fmlL2YBuhhR_BE=
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp?tbs=sbi:AMhZZiu1U_1mLzDh7oSZSFpNdcot7N84lXExmiJp6LMIJ1NO2PHle7mcBr72CTgX45DTbkkF8yfhvT0GATXTIzgd--ayauOaI-gLvTa-DAeOAodk35Kz6mpzCzl8ly6YYdUbn2S5cCe35BP37ysxT-tSFbvtLPwovJuiNPmunzpk_1j0a88zkXOmb1tn3vfgXnb6IhaucZJIMZztSDOIljOiaSTIzhdQ1aSusETDAu3EMNnoWRaFWqzcUGHzIWuABI9gJkelzjDaV-aK4ilxQJhwGJnzuKNHDbJ4GSX33an2jIfssmwfoWZLFej_1V0Zijr2fuFqULhoAg2lku41nHNxHY1nI0gNU4M2Q&ec=GAZAAQ
/search?ie=UTF-8&q=Anne Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY=&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=hr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAU    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=hi&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAY    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=bn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAc    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=te&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAg    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=mr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAk    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=ta&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAo    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=gu&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAs    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=kn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=ml&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA0    
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=pa&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA4    
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_5Sxk31MIG7AiUTjwI71yoWFyO_E=
/intl/en/policies/privacy/
/intl/en/policies/terms/

The content fetched by requests module doesn't contain the full web page I don't know why. I want to scrape information in image ,anchor and h3 tags from the page using beautiful soup but its just not working out.

CodePudding user response:

The main problem is Python Requests module doesn't render JavaScript. As a result, you are not getting the webpage you are supposed to get.

You are using a webbrowser module to view your URL where JavaScript is enabled, so you are getting the page as expected. But next, when you use the requests module to get the page, javascript stays disabled, and google doesn't let you render the page but instead redirects you to another page(Google Homepage). And there, you get different HTML resulting in no search results(you did in the first place).

Showing that final request get redirected

IN 1 is the URL you are trying to hit, and 2 is the URL you are redirected to.
Look at the difference is google.com/webhp?tbs=sbi:AMhZZisX... VS google.com/search?tbs=sbi:AMhZZisX...

The HTML of that page results in is this -

Final OutPut

Always use the source HTML given by the requests module, which shows you the actual result. As you can see, this is not the search result page.

So to reach your goal, try using Selenium.

  • Related