Home > Net >  Scrape sibling tags and associate as a parent-child relationship
Scrape sibling tags and associate as a parent-child relationship

Time:11-05

I want to extract content from two different tags using PHP. I want to associate h2 tags with the div tags' content that immediately follows them -- like a parent-child relationship.

<h1>Title 1</h1>
<div >some data and divs here 1</div>
<h1>Title 2</h1>
<div >some data and divs here 2</div>
<div >some data and divs here 3</div>
<h1>Title 3</h1>
<div >some data and divs here 4</div>
<div >some data and divs here 5</div>
<div >some data and divs here 6</div>

The number of items between two H1 tag is different.

I know how to scrape all tags with simple_html_dom or Goutte\Client to get:

<h1>Title 1</h1>
<h1>Title 2</h1>
<h1>Title 3</h1>

Or

<div >some data and divs here 1</div>
<div >some data and divs here 2</div>
<div >some data and divs here 3</div>
<div >some data and divs here 4</div>
<div >some data and divs here 5</div>
<div >some data and divs here 6</div>

But I am unable to associate the title to the data. I cannot figure out how to have an array like this:

array (
  0 => 
  array (
    'item' => 'Title 1',
    'data' => 'some data and divs here 1',
  ),
  1 => 
  array (
    'item' => 'Title 2',
    'data' => 'some data and divs here 2',
  ),
  2 => 
  array (
    'item' => 'Title 2',
    'data' => 'some data and divs here 3',
  ),
  3 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 4',
  ),
  4 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 5',
  ),
  5 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 6',
  ),
)

I've tried to implement something like sibling, but didn't find a way.

CodePudding user response:

Based on the answer on XPath until next tag, I've made very few modifications to generate the desired result.

Code: (Demo)

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$domNodeList = $xpath->query('/html/body/h1');

$result = [];
foreach($domNodeList as $element) {
    // Save the h1
    $item = $element->nodeValue;

    // Loop the siblings unit the next h1
    while ($element = $element->nextSibling) {
        if ($element->nodeName === "h1") {
            break;
        }
        // if Node is a DOMElement
        if ($element->nodeType === 1) {
            $result[] = ['item' => $item, 'data' => $element->nodeValue];
        }
    }
}
var_export($result);

CodePudding user response:

Here's an idea, use some string manipulation to wrap the parts between the h1 in a span (for example). Then read it using php's DOMDocument getting the html by the tag names (h1 and span)

Here's my attempt:

$html = '<h1>Title 1</h1>
<div >some data and divs here 1</div>
<h1>Title 2</h1>
<div >some data and divs here 2</div>
<div >some data and divs here 3</div>
<h1>Title 3</h1>
<div >some data and divs here 4</div>
<div >some data and divs here 5</div>
<div >some data and divs here 6</div>';

$html = str_replace('</h1>', '</h1><span>', $html);
$html = str_replace('<h1>', '</span><h1>', $html);
$html = "<span>$html</span>";

$xml = new DOMDocument();
$xml->loadHTML($html);

$items = array();
foreach($xml->getElementsByTagName('span') as $item) {
    $items[] = trim($item->nodeValue);
}
array_shift($items);  // ignore first

$titles = array();
foreach($xml->getElementsByTagName('h1') as $title) {
    $titles[] = trim($title->nodeValue);
}

Output for $items and $titles:

Array
(
    [0] => some data and divs here 1
    [1] => some data and divs here 2
some data and divs here 3
    [2] => some data and divs here 4
some data and divs here 5
some data and divs here 6
)
Array
(
    [0] => Title 1
    [1] => Title 2
    [2] => Title 3
)
  • Related