I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.
- I know
<h1> a dummy title </h1>
from previous, so it can help to validate the exact one. - I do not know any ID and CLASS attribute.
<body>
<div>
<h1> a dummy title </h1>
<script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>
<div >
<p>...</p>
</div>
<div >
<p>...</p>
<div >...</div>
<p>...</p>
<div >...</div>
<p>...</p>
<p>...</p>
</div>
<div >
<p>...</p>
<p>...</p>
</div>
<div >
<p>...</p>
<p>...</p>
<div >...</div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
</div>
</body>
What I have tried with:
I have tried to find the <div>
with maximum <p>
but sometimes there are some other <div>
with maximum <p>
, I have to avoid them by finding nearest <h1>
$html=
'[My html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];
#get the number of p children in each div
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
array_push($pchilds,$childs);}
#now find the div with the max number of p children
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
if ($childs == max($pchilds))
echo ($pee->nodeValue);
#or do whatever
}
CodePudding user response:
Find all divs with p
elements, then counting p
elements inside each, finally getting the first with the max()
count
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');
// Count `p` inside them
foreach($divs as $idx=>$d) {
$cnt = (int) $xpath->evaluate('count(.//p)', $d);
$dcnt[$idx] = $cnt;
}
// show content of div with max() count
foreach($divs as $idx=>$d) {
if( $dcnt[$idx] == max($dcnt) ){
print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
break;
}
}
CodePudding user response:
XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1>
containing with max <p>
:
//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]
Output :
<div >
<p>...</p>
<div >...</div>
<p>...</p>
<div >...</div>
<p>...</p>
<p>...</p>
</div>'
XPath 1.0 approximate solution (could return the wrong div) :
//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]