Home > Net >  XPath - How to extract element following by the parent parent H1 tag
XPath - How to extract element following by the parent parent H1 tag

Time:05-29

I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.

  • I know <h1> a dummy title </h1> from previous, so it can help to validate the exact one.
  • I do not know any ID and CLASS attribute.
<body>
    <div>
        <h1> a dummy title </h1>

            <script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>

        <div >
            <p>...</p>
        </div>

        <div >
            <p>...</p>
                <div >...</div>
            <p>...</p>
                <div >...</div>
            <p>...</p>
            <p>...</p>
       </div>

       <div >
            <p>...</p>
            <p>...</p>
       </div>

       <div >
            <p>...</p>
            <p>...</p>
                <div >...</div>
            <p>...</p>
            <p>...</p>
            <p>...</p>
       </div>
  </div>
</body>

What I have tried with:
I have tried to find the <div> with maximum <p> but sometimes there are some other <div> with maximum <p>, I have to avoid them by finding nearest <h1>

$html= 
'[My html above]

';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );   
$xpath = new DOMXPath($HTMLDoc);

#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];

#get the number of p children in each div
foreach ($pees as $pee) {
    $childs = $pee->childElementCount;
    array_push($pchilds,$childs);}

#now find the div with the max number of p children
foreach ($pees as $pee) {
    $childs = $pee->childElementCount;
    if ($childs == max($pchilds))
       echo ($pee->nodeValue);
       #or do whatever
}

CodePudding user response:

Find all divs with p elements, then counting p elements inside each, finally getting the first with the max() count

$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);

$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');

// Count `p` inside them
foreach($divs as $idx=>$d) {
    $cnt = (int) $xpath->evaluate('count(.//p)', $d);
    $dcnt[$idx] = $cnt;
}

// show content of div with max() count
foreach($divs as $idx=>$d) {
   if( $dcnt[$idx] == max($dcnt) ){
        print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
        break;
   }
}

CodePudding user response:

XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1> containing with max <p> :

//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]

Output :

<div >
            <p>...</p>
                <div >...</div>
            <p>...</p>
                <div >...</div>
            <p>...</p>
            <p>...</p>
       </div>'

XPath 1.0 approximate solution (could return the wrong div) :

//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]
  • Related