I want to extract content from two different tags using PHP. I want to associate h2 tags with the div tags' content that immediately follows them -- like a parent-child relationship.
<h1>Title 1</h1>
<div >some data and divs here 1</div>
<h1>Title 2</h1>
<div >some data and divs here 2</div>
<div >some data and divs here 3</div>
<h1>Title 3</h1>
<div >some data and divs here 4</div>
<div >some data and divs here 5</div>
<div >some data and divs here 6</div>
The number of items between two H1 tag is different.
I know how to scrape all tags with simple_html_dom or Goutte\Client to get:
<h1>Title 1</h1>
<h1>Title 2</h1>
<h1>Title 3</h1>
Or
<div >some data and divs here 1</div>
<div >some data and divs here 2</div>
<div >some data and divs here 3</div>
<div >some data and divs here 4</div>
<div >some data and divs here 5</div>
<div >some data and divs here 6</div>
But I am unable to associate the title to the data. I cannot figure out how to have an array like this:
array (
0 =>
array (
'item' => 'Title 1',
'data' => 'some data and divs here 1',
),
1 =>
array (
'item' => 'Title 2',
'data' => 'some data and divs here 2',
),
2 =>
array (
'item' => 'Title 2',
'data' => 'some data and divs here 3',
),
3 =>
array (
'item' => 'Title 3',
'data' => 'some data and divs here 4',
),
4 =>
array (
'item' => 'Title 3',
'data' => 'some data and divs here 5',
),
5 =>
array (
'item' => 'Title 3',
'data' => 'some data and divs here 6',
),
)
I've tried to implement something like sibling
, but didn't find a way.
CodePudding user response:
Based on the answer on XPath until next tag, I've made very few modifications to generate the desired result.
Code: (Demo)
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$domNodeList = $xpath->query('/html/body/h1');
$result = [];
foreach($domNodeList as $element) {
// Save the h1
$item = $element->nodeValue;
// Loop the siblings unit the next h1
while ($element = $element->nextSibling) {
if ($element->nodeName === "h1") {
break;
}
// if Node is a DOMElement
if ($element->nodeType === 1) {
$result[] = ['item' => $item, 'data' => $element->nodeValue];
}
}
}
var_export($result);
CodePudding user response:
Here's an idea, use some string manipulation to wrap the parts between the h1
in a span
(for example). Then read it using php's DOMDocument
getting the html by the tag names (h1 and span)
Here's my attempt:
$html = '<h1>Title 1</h1>
<div >some data and divs here 1</div>
<h1>Title 2</h1>
<div >some data and divs here 2</div>
<div >some data and divs here 3</div>
<h1>Title 3</h1>
<div >some data and divs here 4</div>
<div >some data and divs here 5</div>
<div >some data and divs here 6</div>';
$html = str_replace('</h1>', '</h1><span>', $html);
$html = str_replace('<h1>', '</span><h1>', $html);
$html = "<span>$html</span>";
$xml = new DOMDocument();
$xml->loadHTML($html);
$items = array();
foreach($xml->getElementsByTagName('span') as $item) {
$items[] = trim($item->nodeValue);
}
array_shift($items); // ignore first
$titles = array();
foreach($xml->getElementsByTagName('h1') as $title) {
$titles[] = trim($title->nodeValue);
}
Output for $items
and $titles
:
Array
(
[0] => some data and divs here 1
[1] => some data and divs here 2
some data and divs here 3
[2] => some data and divs here 4
some data and divs here 5
some data and divs here 6
)
Array
(
[0] => Title 1
[1] => Title 2
[2] => Title 3
)