Home > Software design >  Scrape data from HTML page using DOMDocument
Scrape data from HTML page using DOMDocument

Time:11-11

I am new in PHP and trying to make script which can get data from external site. I am interesting in getting value of Merk which is Opel. HTML code for it is like this

<div class="row">
    <div class="col-6 col-sm-5 label" data-tooltip="<strong>Merk</strong><br/>Het merk van het voertuig. Dit wordt voor alle voertuigsoorten geregistreerd.
<span>bron: RDW</span> ">
        Merk
    <span data-toggle="tooltip" data-html="true" title="<strong>Merk</strong><br/>Het merk van het voertuig. Dit wordt voor alle voertuigsoorten geregistreerd.<br /><span>bron: RDW</span> "></span><span data-toggle="tooltip" data-html="true" title="<strong>Merk</strong><br/>Het merk van het voertuig. Dit wordt voor alle voertuigsoorten geregistreerd.<br /><span>bron: RDW</span> "></span></div>
    <div class="col-6 col-sm-7 value">
Opel
    </div>
</div>

I am trying to get it with PHP code like below

<?php
// a new dom object
$dom = new domDocument; 

// load the html into the object
$dom->loadHTML('https://centraalbeheerkentekencheck.azurewebsites.net/?kenteken=L-762-LZ'); 

// discard white space
$dom->preserveWhiteSpace = false;

$rowData= $dom->getElementsByTagName('row');

But now I am stuck and does not know how I can finish remain code so I can get value of Merk whiich is Opel. Let me know if anyone here can help me to achieve my goal.

Thanks!

CodePudding user response:

I think it is better to use SimpleHtmlDom for this (like voku/simple_html_dom):

composer install voku/simple_html_dom

The SimpleHtmlDom version

You used the url https://centraalbeheerkentekencheck.azurewebsites.net/?kenteken=L-762-LZ for this, but it contains an iframe to: https://centraalbeheer.finnik.nl/kenteken/l762lz/gratis, so I use that one instead in the script:

use voku\helper\HtmlDomParser;
require_once __DIR__ . "/vendor/autoload.php";

function getBrand(string $license) : string
{
    $license = strtolower(str_replace("-", "", $license));
    $dom = HtmlDomParser::file_get_html("https://centraalbeheer.finnik.nl/kenteken/".$license."/gratis");
    $brand = $dom->find(".result .row .value")[0]->innerHtml();
    return str_replace(["&#13;", "\n", "\r"], "", $brand);
}

var_dump(getBrand("L-762-LZ"));

Update: You can also do this with regex

function getBrandRegex(string $license) : string
{
    $license = strtolower(str_replace("-", "", $license));
    $content = file_get_contents("https://centraalbeheer.finnik.nl/kenteken/".$license."/gratis");
    preg_match_all('/<div >(.*?)<\/div>/s', $content, $matches);
    $brand = $matches[1][0];
    return trim(str_replace(["&#13;", "\n", "\r"], "", $brand));
}

var_dump(getBrandRegex("L-762-LZ"));

Update: The DomDocument version

function getBrandDomDocument(string $license) : string
{
    libxml_use_internal_errors(true); //see: https://www.php.net/manual/en/function.libxml-use-internal-errors.php
    $license = strtolower(str_replace("-", "", $license));
    $dom = new \DomDocument;
    $dom->loadHTMLFile("https://centraalbeheer.finnik.nl/kenteken/".$license."/gratis");
    $dom->preserveWhiteSpace = false;

    $xpath = new \DOMXPath($dom);
    $data = $xpath->query("//div[contains(@class, 'col-6 col-sm-7 value')]");

    return trim(str_replace(["&#13;", "\n", "\r"], "", $data[0]->textContent));
}

var_dump(getBrandDomDocument("L-762-LZ"));

Output

Opel
  • Related