Sorry for a Noob question.... I have written a code which searches google for an image stored locally on my computer. I accomplished this using the requests module. I want to scrape the result page for information about the image but request module never fetches the entire page. It only fetches a part of it and thus I am not able to scrape the website for results
import requests
import webbrowser
from bs4 import BeautifulSoup
filePath = "C:\\Users\\mjjha\\Documents\\Checkrow\\monaLisa.jpg"
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
r=requests.get(fetchUrl)
webbrowser.open(fetchUrl)
soup=BeautifulSoup(r.content,'html.parser')
head=soup.find_all('a')
for i in head:
print(i['href'])
but when I scrape it for anchor tag links using beautiful soup I get the following result:
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp?tbs=sbi:AMhZZisFTqPOZmEYpGB89rRLg4R4TfmF3WVQ_1gHEFiENQ8wbYqQq7-KsJUE5KuuxvINd0hFo10EMmP4RzWvOBvRxmsHZ7vm6etW-I36-QfCwmwir1NawORzsWZJffCnSwTpdts39mmQ1EfkcH0R8eGsiJ4_1Xw9DA_1C9mqLpChwRYdgOT-oFNcpt2O25Zhmo6ouG2XA5ZelCbKAChT4DJfGz0TXphXB_1dGEluDV6R_15n42URKCX5Q1zIqR6_16CR0rgXBphz95FMrETLqtPURRbAaWzauYisnSk6jF_1T5GbuoJKHtqThXevhogUSW9ERfZr5vbbWI6DA9c&ec=GAZAAQ
/search?ie=UTF-8&q=Anne Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY=&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=hr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=hi&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=bn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=te&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=mr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=ta&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=gu&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=kn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=ml&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0=&hl=pa&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_fWA_BOpuaXy87gYOc9cg4jE6KwU=
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp?tbs=sbi:AMhZZiuVpzNQ_10ZbUunarcxtcBADLP2GPHTlIiPAtNpsOezRg48a1S4ofT8Df9C4NRT_1PzMk4baDStlOQGRp2okPHCALA5TvLodRqGj_1Q9_19KWykusySDjQNkbi67Ob6Kx7LZ0ybQ59c9mvyda27CBq8_19XutgXXgl4hGCLdXX9M3Od0WI9BckSHxv_1zajCMhKj1XaLKl9p7T9S0hfQbyZs4zQNcudXEk_1y3Zle6anU1rmSpEdpeCXC6r_1HnTnTLAtYWQlVFVF6QuCT8W5djGXGXwTjQH7NgkXnOi6q7v4F_1DqTVytnSAcBX6rc1eJFlXwIR2dzR73cs983mzb686VgOqUUNS1IG8w&ec=GAZAAQ
/search?ie=UTF-8&q=Anne Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY=&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=hr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=hi&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=bn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=te&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=mr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=ta&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=gu&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=kn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=ml&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ=&hl=pa&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_d19wKMnK5qQH_fmlL2YBuhhR_BE=
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp?tbs=sbi:AMhZZiu1U_1mLzDh7oSZSFpNdcot7N84lXExmiJp6LMIJ1NO2PHle7mcBr72CTgX45DTbkkF8yfhvT0GATXTIzgd--ayauOaI-gLvTa-DAeOAodk35Kz6mpzCzl8ly6YYdUbn2S5cCe35BP37ysxT-tSFbvtLPwovJuiNPmunzpk_1j0a88zkXOmb1tn3vfgXnb6IhaucZJIMZztSDOIljOiaSTIzhdQ1aSusETDAu3EMNnoWRaFWqzcUGHzIWuABI9gJkelzjDaV-aK4ilxQJhwGJnzuKNHDbJ4GSX33an2jIfssmwfoWZLFej_1V0Zijr2fuFqULhoAg2lku41nHNxHY1nI0gNU4M2Q&ec=GAZAAQ
/search?ie=UTF-8&q=Anne Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY=&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=hr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=hi&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=bn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=te&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=mr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=ta&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=gu&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=kn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=ml&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM=&hl=pa&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_5Sxk31MIG7AiUTjwI71yoWFyO_E=
/intl/en/policies/privacy/
/intl/en/policies/terms/
The content fetched by requests module doesn't contain the full web page I don't know why. I want to scrape information in image ,anchor and h3 tags from the page using beautiful soup but its just not working out.
CodePudding user response:
The main problem is Python Requests module doesn't render JavaScript. As a result, you are not getting the webpage you are supposed to get.
You are using a webbrowser module to view your URL where JavaScript is enabled, so you are getting the page as expected. But next, when you use the requests module to get the page, javascript stays disabled, and google doesn't let you render the page but instead redirects you to another page(Google Homepage). And there, you get different HTML resulting in no search results(you did in the first place).
IN 1 is the URL you are trying to hit, and 2 is the URL you are redirected to.
Look at the difference is google.com/webhp?tbs=sbi:AMhZZisX...
VS google.com/search?tbs=sbi:AMhZZisX...
The HTML of that page results in is this -
Always use the source HTML given by the requests module, which shows you the actual result. As you can see, this is not the search result page.
So to reach your goal, try using Selenium.