Home > OS >  Scraping websites with PHP
Scraping websites with PHP

Time:11-18

I'm trying to scrap information directly from the maersk website. Exemple, i'm trying scraping the information from this URL https://www.maersk.com/tracking/221242675 I Have a lot of tracking nunbers to update every day on database, so I dicided automate a little bit.

But, if have the following code, but its saying that need JS to work. I alredy even tryed with curl, etc. But nothing work. Any one know another way?

I tryed the following code:


<?php
// ------------ teste 14 ------------
$html = file_get_contents('https://www.maersk.com/tracking/#tracking/221242675'); //get the html returned from the following url
echo $html;
$ETAupdate = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)){ //if any html is actually returned

    $ETAupdate->loadHTML($html);
    libxml_clear_errors(); //remove errors for yucky html
    
    $ETA_xpath = new DOMXPath($ETAupdate);

    //get all the h2's with an id
    $ETA_row = $ETA_xpath->query('//strong');

    if($ETA_row->length > 0){
        foreach($ETA_row as $row){
            echo $row->nodeValue . "<br/>";
        }
    }
}
?>

CodePudding user response:

You need to scrape the data directly from their API requests, rather than trying to scrape the page URL directly (Unless you're using something like puppeteer, but I really don't recommend that for this simple task)

I took a look at the site and the API endpoint is:

https://api.maersk.com/track/221242675?operator=MAEU

This will return a JSON-formatted response which you can parse and use to extract the details. It'll also give you a much easier method to access the data rather than parsing the HTML. Example below.

{
    "tpdoc_num": "221242675",
    "isContainerSearch": false,
    "origin": {
        "terminal": "YanTian Intl. Container Terminal",
        "geo_site": "1PVA2R05ZGGHQ",
        "city": "Yantian",
        "state": "Guangdong",
        "country": "China",
        "country_code": "CN",
        "geoid_city": "0L3DBFFJ3KZ9A",
        "site_type": "TERMINAL"
    },
    "destination": {
        "terminal": "DCT Gdansk sa",
        "geo_site": "02RB4MMG6P32M",
        "city": "Gdansk",
        "state": "",
        "country": "Poland",
        "country_code": "PL",
        "geoid_city": "3RIGHAIZMGKN3",
        "site_type": "TERMINAL"
    },
    "containers": [ ... ]
}
  • Related