I have a problem. I am scraping data from Google news. Now the problem that I have is : in development I get a good result but in production only the the same code do not work and shows white page without errors. I repeat again in development the result is good. Here it's the code source:
<?php
require __DIR__."/../../../vendor/autoload.php";
use Goutte\Client;
function unifyUrl($q)
{
return 'https://news.google.com/search?q=' . $q . '&hl=fr&gl=FR&ceid=FR:fr&dpr=2';
}
$client = new Client();
$url = unifyUrl('* site:*.cd');
$crawler = $client->request('GET', 'https://news.google.com/search?q=$ site:*.cd&hl=fr&gl=FR&ceid=FR:fr&dpr=2');
$crawler->filter('#yDmH0d > c-wiz.zQTmif.SSPGKf > div > div.FVeGwb.CVnAc.Haq2Hf.bWfURe > div.ajwQHc.BL5WZb.RELBvb > div.tsldL.Oc0wGc.RELBvb > main > c-wiz > div.lBwEZb.BL5WZb.GndZbb > div.NiLAwe.y6IFtc.R7GTQ.keNKEd.j7vNaf.nID9nc')->each(function ($node)
{
//$title = $node->filter('.field-content > a')->text();
echo $node->text();// nothing appears
$link = 'https://news.google.com' . $node->filter('a')
->attr('href');
$img = $node->filter('a > figure > img')
->attr('src');
$title = $node->filter('div > article > h3')
->text();
$source = $node->filter('div > article > div > div > a')
->text();
$date = $node->filter('div > article > div > div > time')
->text();
// You do echo $title nothing appears in production.
}); ?>
If someone can help,
CodePudding user response:
production is hosted where? on digitalocean or something like that? google will read the IP address of the server you are using, and then make assumptions about whether or not you are a bot.
if they see "OH THIS IS DIGITALOCEAN IP ADDRESS" they will go "BLOCK"
development is hosted where? on your local computer? then they see "OH THIS IS FROM LOCAL ISP IN . THIS ONE IS NOT BOT"
you can TRY a proxy service but a paid one I guess. So its unique to you. And send the requests on the prod server through that