Home > Software engineering >  CURL and file_get_contents() not working on certain website
CURL and file_get_contents() not working on certain website

Time:06-17

I am trying to scrape this website: https://bartleby.com, I tried to write a code using Python requests and it works. But I am trying to convert it to PHP because I want the result to be printed on my website and my Cpanel does not read python, so I am forced to use CURL to do this but did not work the code below returns:

Not Found
This page you were trying to reach at this address doesn't seem to exist.
What can I do now?
Sign up for your own free account.

So I am just wondering how this website blocks CURL on PHP but not Requests on Python? Are there any undetectable alternatives to CURL on PHP? Thanks.

My PHP Code (Not Working):

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority' => 'www.bartleby.com',
    'accept' => 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language' => 'en-US;q=0.6',
    'cache-control' => 'max-age=0',
    'sec-fetch-dest' => 'document',
    'sec-fetch-mode' => 'navigate',
    'sec-fetch-site' => 'same-origin',
    'sec-fetch-user' => '?1',
    'sec-gpc' => '1',
    'upgrade-insecure-requests' => '1',
    'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
    'Accept-Encoding' => 'gzip',
]);
curl_setopt($ch, CURLOPT_COOKIE, 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu Jun 16 2022 20:39:43 GMT+0800 (China Standard Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001:1,C0003:1,BG142:0,C0002:0,C0005:0,C0004:0&AwaitingReconsent=false');

$response = curl_exec($ch);
echo $response;
curl_close($ch);

I also tried to use file_get_contents() but it returns an error: Warning: file_get_contents(https://bartleby.com): Failed to open stream: HTTP request failed! HTTP/1.1 503 Service Temporarily Unavailable in D:\xampp\htdocs\bartleby\index.php on line 11

Line 11 is $response = file_get_contents($url, false, stream_context_create($arrContextOptions));

Full code (Not Working):

<?php
$url= 'https://bartleby.com';

$arrContextOptions=array(
      "ssl"=>array(
            "verify_peer"=>false,
            "verify_peer_name"=>false,
        ),
    );  

$response = file_get_contents($url, false, stream_context_create($arrContextOptions));
echo $response;

My Python Code (Working):

import requests

cookies = {
    'G_ENABLED_IDPS': 'google',
    'refreshToken': '330bb387263aa6673c3e39e975d729f723b38002',
    'userId': '4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3',
    'userStatus': 'A1',
    'promotionId': '',
    'sku': 'bb999_bookstore',
    'endCycleWhenQuestionsRemainingWasClosed': '2022-06-19T07:00:00.000Z',
    'btbHomeDashboardTooltipAnimationCount': '0',
    'isNoQuestionAskedModalClosed': 'true',
    'accessToken': '34ceed9609a07bd0238a74b5650d5c5362990498',
    'bartlebyRefreshTokenExpiresAt': '2022-07-16T12:37:57.217Z',
    'btbHomeDashboardAnimationTriggerDate': '2022-06-17T12:39:25.907Z',
    'OptanonConsent': 'isGpcEnabled=1&datestamp=Thu Jun 16 2022 20:39:43 GMT+0800 (China Standard Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001:1,C0003:1,BG142:0,C0002:0,C0005:0,C0004:0&AwaitingReconsent=false',
}

headers = {
    'authority': 'www.bartleby.com',
    'accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'en-US;q=0.6',
    'cache-control': 'max-age=0',
    # Requests sorts cookies= alphabetically
    # 'cookie': 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu Jun 16 2022 20:39:43 GMT+0800 (China Standard Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001:1,C0003:1,BG142:0,C0002:0,C0005:0,C0004:0&AwaitingReconsent=false',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'sec-gpc': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
}

response = requests.get('https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e', cookies=cookies, headers=headers)
print(response.text)

CodePudding user response:

You did not set user agent.

It's look like that website required user agent from real user such as Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0.

Here is my code that just work.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);// this is needed.
// But in my code is using user agent from web browser directly.
// You may change this to other.
curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'headerFunction');// for debug response headers only.

$response = curl_exec($ch);

if (curl_errno($ch)) {
    echo 'cURL error: ' . curl_error($ch);
    echo '<br>';
    exit();
}

echo '<hr>' . PHP_EOL;
echo '<h4>cURL response body</h4>' . PHP_EOL;
echo $response;
curl_close($ch);

unset($ch, $response);


/**
 * Header function for debugging
 */
function headerFunction($ch, $header)
{
    echo $header;
    echo '<br>';
    return mb_strlen($header);
}

Your code set request headers using wrong array format.

curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority' => 'www.bartleby.com',
    //...
]);

This is WRONG!
It should be...

curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority: www.bartleby.com',
    //...
]);

You can use $reqHeaders = curl_getinfo($ch, CURLINFO_HEADER_OUT); to debug request headers.

Your current code did not sent user-agent at all that's why it doesn't work.

  • Related