Home > Software design >  file_get_contents returns 403 forbidden with user agent - PHP
file_get_contents returns 403 forbidden with user agent - PHP

Time:06-16

I'm just trying to get the title from this product page, however it keeps showing a 403 forbidden error.

Warning: file_get_contents(https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /Applications/AMPPS/www/get_prod.php on line 13"

I tried adding the user-agent in there but still doesn't seem to work. Maybe it isn't possible.

Code below:

        <?php
include('simple_html_dom.php');

$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

echo file_get_contents("https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795", false, $context);
?>

CodePudding user response:

This website has 3 anti bots systems:

  1. Riskified.
  2. Forter.
  3. Cloudflare.

They are used to prevent DoS/DDoS atacks, crawling tasks.... Basically you can't easily crawl them with a simple request.

To bypass them you need to simulate/use real browser. You can use selenium or playwright.
I will show you an example of crawling this website with playwright and python.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.webkit.launch(headless=True)
    baseurl = "https://www.brownsfashion.com/uk/shopping/jem-18k-yellow-gold-octogone-double-paved-ring-17648795"
    page = browser.new_page()
    page.goto(baseurl)
    title = page.wait_for_selector("//a[@data-test='product-brand']")
    name = page.wait_for_selector("//span[@data-test='product-name']")
    price = page.wait_for_selector("//span[@data-test='product-price']")
    print("Title: "   title.text_content())
    print("Name: "   name.text_content())
    print("Price: "   price.text_content())
    browser.close()

I hope I have been able to help you.

  • Related