Home > OS >  How to scrape all the page without suddenly being detected as a bot using python
How to scrape all the page without suddenly being detected as a bot using python

Time:10-01

What I am trying to do is to scrape a restaurant using the given URL from the database. The host is https://www.just-eat.co.{tenant}. Then from the response I will get the window.__INITIAL_STATE__ that contains the json.

for resto in restos:
   host = resto['menu_url'].replace('https://', '').split('/')[0]
   headers = {
                'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                'Accept-Encoding': 'gzip, deflate, br',
                'Accept-Language': 'en-US,en;q=0.9',
                'Cache-Control': 'max-age=0',
                'Connection': 'keep-alive',
                'Content-Type': 'application/json',
                'Host': host,
                'sec-ch-ua': "\"Google Chrome\";v=\"93\", \" Not;A Brand\";v=\"99\", \"Chromium\";v=\"93\"",
                'sec-ch-ua-mobile': '?0',
                'Sec-Fetch-Dest': 'document',
                'Sec-Fetch-Mode': 'navigate',
                'Sec-Fetch-Site': 'same-origin',
                'Sec-Fetch-User': '?1',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
            }

   response = requests.get(url=resto['menu_url'], headers=headers)
   data = re.search('(?<=window.__INITIAL_STATE__=)(.*)(?=<)', response.text).group(1)
   data = json.loads(data)

Here is the problem: When I am scraping set of restaurants, I can gather data from around 5 resto from the start (With full HTML of the page), then suddenly I will get this (HTML below), then suddenly I can gather full HTML again, and so on.

<html>
    <head>
        <META NAME="robots" CONTENT="noindex,nofollow">
        <script src="/_Incapsula_Resource?SWJIYLWA=5074a7">
        </script>
    <body>
    </body>
</html>

Getting this HTML will give me an error because I am trying to access the json with fixed keys. Try-Except is not a solution since I can access the resto URL in the web, unless the page cannot be found. What I want is not to encounter the HTML above, only the HTML that contains window.__INITIAL_STATE__, the full HTML of the page.

<script>window.__INITIAL_STATE__={...

Also, I am using a VPN to access the resto platform since it is block in my country.

What am I missing here? Is it something to do with headers? I copied the header based on the header on the web when trying to access the resto URL.

CodePudding user response:

Possible causes:

1. Scraping too quickly can cause the system to detect you as a bot. Add time.sleep() to slow things down.

2. In my experience, when scraping a site that can detect that you are a bot, it will be checking if you have the cookies that it gives users when they are on the site, so take a look at the cookies it has given you and see if using the same cookies work. There are multiple libraries that work with requests to use cookies. Refrence

3. Some websites also check to see if your client has JS enabled which if disabled can cause you to be detected as a bot. Refrence

4. Finally, some websites use Cloudflare or other services that detect bots which are very hard to bypass. Just because the screen that says "Checking your browser's IP. Powered by Cloudflare." doesn't show up when entering the site doesn't mean they are not using Cloudflare. cfsrape and cloudscrape modules may work on some sites, usually not, though. Refrence

(I am currently on 49 points, if this post did not help, please tell me to tweak it or remove it. I would appreciate you refraining from downvoting, Thank you.)

  • Related