I'm trying to scrap information directly from the maersk website. Exemple, i'm trying scraping the information from this URL https://www.maersk.com/tracking/221242675 I Have a lot of tracking nunbers to update every day on database, so I dicided automate a little bit.
But, if have the following code, but its saying that need JS to work. I alredy even tryed with curl, etc. But nothing work. Any one know another way?
I tryed the following code:
<?php
// ------------ teste 14 ------------
$html = file_get_contents('https://www.maersk.com/tracking/#tracking/221242675'); //get the html returned from the following url
echo $html;
$ETAupdate = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$ETAupdate->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$ETA_xpath = new DOMXPath($ETAupdate);
//get all the h2's with an id
$ETA_row = $ETA_xpath->query('//strong');
if($ETA_row->length > 0){
foreach($ETA_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
CodePudding user response:
You need to scrape the data directly from their API requests, rather than trying to scrape the page URL directly (Unless you're using something like puppeteer, but I really don't recommend that for this simple task)
I took a look at the site and the API endpoint is:
https://api.maersk.com/track/221242675?operator=MAEU
This will return a JSON-formatted response which you can parse and use to extract the details. It'll also give you a much easier method to access the data rather than parsing the HTML. Example below.
{
"tpdoc_num": "221242675",
"isContainerSearch": false,
"origin": {
"terminal": "YanTian Intl. Container Terminal",
"geo_site": "1PVA2R05ZGGHQ",
"city": "Yantian",
"state": "Guangdong",
"country": "China",
"country_code": "CN",
"geoid_city": "0L3DBFFJ3KZ9A",
"site_type": "TERMINAL"
},
"destination": {
"terminal": "DCT Gdansk sa",
"geo_site": "02RB4MMG6P32M",
"city": "Gdansk",
"state": "",
"country": "Poland",
"country_code": "PL",
"geoid_city": "3RIGHAIZMGKN3",
"site_type": "TERMINAL"
},
"containers": [ ... ]
}