Thanks for looking into this. Though it may be a simple problem, I am too new at scaping pages to understand why this simple code returns 'false'. Most examples I see online use the base url, but I am trying to scape a specific product page. Using 'http://www.google.com/' works fine. Could it be I am being blocked? If so, how would one get around it in php? In python one would rotate User-Auths and proxies. Any nuggets of knowldge will be appreciated. Here is the basic code with the specific link.
require_once($_SERVER['DOCUMENT_ROOT'].'/includes/simple_html_dom.php');
$url = 'https://www.lowes.com/pd/Frigidaire-Gallery-22-cu-ft-Counter-depth-Side-by-Side-Refrigerator-with-Ice-Maker-Fingerprint-Resistant-Black-Stainless-Steel/1000368269';
$html = file_get_html($url);
Thanks guys.
CodePudding user response:
Lowes is implementing some anti-scraping technology so you cannot rely on file_get_html
. However, you can make use of PHP's curl
functions and then use str_get_html
from Simple HTML DOM.
<?php
require_once($_SERVER['DOCUMENT_ROOT'].'/includes/simple_html_dom.php');
$url = 'https://www.lowes.com/pd/Frigidaire-Gallery-22-cu-ft-Counter-depth-Side-by-Side-Refrigerator-with-Ice-Maker-Fingerprint-Resistant-Black-Stainless-Steel/1000368269';
// From https://gist.github.com/fijimunkii/952acac988f2d25bef7e0284bc63c406
$user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
];
// Get random user agent
$user_agent = $user_agents[rand(0,count($user_agents)-1)];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
$exec = curl_exec($ch);
$html = str_get_html($exec);