Home > Enterprise >  Why does using Amazon API gateway give the wrong HTML page when using requests.get(URL)
Why does using Amazon API gateway give the wrong HTML page when using requests.get(URL)

Time:04-03

I'm currently building a web scraper and have run into the issue of being IP blocked. To get around this issue I'm trying to use the requests_ip_rotator which use AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping. Following this answer I've implemented it into my code which is below:

import requests
from bs4 import BeautifulSoup
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS

url = "https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1"
page1 = requests.get(url)
soup1 = BeautifulSoup(page1.content, "html.parser")

gateway = ApiGateway("https://secure.runescape.com/",access_key_id="****",access_key_secret="****")
gateway.start()
session = requests.Session()
session.mount("https://secure.runescape.com/", gateway)
page2 = session.get(url)
gateway.shutdown() 
soup2 = BeautifulSoup(page2.content, "html.parser")

print("\n" page1.url)
print(page2.url)
print(soup1.head.title==soup2.head.title)
input()

output:

Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://secure.runescape.com/ - IP Rotate API' (10 new).
Deleting gateways for site 'https://secure.runescape.com'.
Deleted 10 endpoints with for site 'https://secure.runescape.com'.

https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1
https://6kesqk9t6d.execute-api.eu-central-1.amazonaws.com/ProxyStage/m=hiscore_oldschool_ironman/a=13/overall
False

So both times I use the .get(url) method I am using the same url but receiving different pages. Request.get(url) is giving me the page I want but when I use the amazon gateway with session.get(url) it is not giving me the same page as before but a different page from the same site. I'm stumped for what the issue could be so any help would be greatly appreciated!

CodePudding user response:

When making get requests to the "https://secure.runescape.com" domain using the AWS gateway I noticed that if the URL path is: "a=13/group-ironman/?groupSize=5&page=x" for any x then I get a 302 response (redirect response) which redirects me to the URL path "/a=13/overall". This leads me to believe that the runescape server is redirecting AWS IP's for some URL's but fortunately its not redirecting my own IP.

So my workaround is to use requests.get() without the AWS gateway for URL's that are being redirected and for other URL's of the same site the AWS gateway is not being redirected so I am still using it to avoid being IP blocked.

  • Related