Searching an XML structure but modifying a node higher in the hierarchy-CodePudding

So as an example here is an MWE XML

<manifest xmlns="http://iuclid6.echa.europa.eu/namespaces/manifest/v1"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <general-information>
        <title>IUCLID 6 container manifest file</title>
        <created>Tue Nov 05 11:04:06 EET 2019</created>
        <author>SuperUser</author>
    </general-information>
    <base-document-uuid>f53d48a9-17ef-48f0-8d0e-76d03007bdfe/f53d48a9-17ef-48f0-8d0e-76d03007bdfe</base-document-uuid>
    <contained-documents>
        <document id="f53d48a9-17ef-48f0-8d0e-76d03007bdfe/f53d48a9-17ef-48f0-8d0e-76d03007bdfe">
            <type>DOSSIER</type>
            <name xlink:type="simple" 
                xlink:href="f53d48a9-17ef-48f0-8d0e-76d03007bdfe_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d"
                >Initial submission</name>
            <first-modification-date>2019-03-27T06:46:39Z</first-modification-date>
            <last-modification-date>2019-03-27T06:46:39Z</last-modification-date>
        </document>
    </contained-documents>
</manifest>

In this case I want to find an attribute xlink:href and replace the name tag with the contents of the file referred to by the xlink:href - in this case f53d48a9-17ef-48f0-8d0e-76d03007bdfe_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d (which is an XML format file as well).

At the moment I use simplexml to pull it into an object and then xml2json library to convert it into a recursive array - but walking it using the normal methods doesn't give me a way to modify a parent node..

I'm not sure how to back up the hierarchy - any suggestions??

CodePudding user response：

So this is where I am right now - xml2array (https://github.com/tamlyn/xml2json) delivers an array of arrays with XML attributes brought out into the array too

<?php
include('./xml2json.php');

$arrayData = [];
$xmlOptions = array(
    "namespaceRecursive" => "True"
);

function &i6cArray(& $array){
    foreach ($array as $key => $value) {
        if(is_array($value)){
            //recurse the array of arrays
            $value = &i6cArray($value);
            $array[$key]=$value;
            print_r($value);
        } elseif ($key == '@xlink:href') {
            // we want to replace the element here with the ref'd file contents
            // So we should get name.content = file contents
            $tempxml = simplexml_load_file($value);
            $tempArrayData = xmlToArray($tempxml);
            $array['content']=$tempArrayData;
        } else {
            //do nothing (at least for now)
        }
    }
    return $array;
}

if (file_exists('manifest.xml')) {
    $xml = simplexml_load_file('manifest.xml');
    $arrayData = xmlToArray($xml,$xmlOptions);
    
    // walk array - we know the initial thing is an array
    $arrayData = &i6cArray($arrayData);
    
    //output result
    $jsonString = json_encode($arrayData, JSON_PRETTY_PRINT);
    file_put_contents('dossier.json', $jsonString);
} else {
    exit("Failed to open manifest.");
}

?>

Since I would have liked to remove the @xlink attributes, but won't die otherwise I am going to insert a 'content' value which will be the referenced XML content.

I would still link to have replaced the entire 'name' key with something

CodePudding user response：

A few bits of background before we get into the specific solution:

The parts of names before a colon are local aliases for a particular namespace, identified by a URI in an xmlns attribute. They need slightly different handling than non-namespaced names; see this reference question for SimpleXML.
PHP's SimpleXML and DOM extensions both have support for a language called "XPath", which lets you search for elements and attributes based on their parents and/or content.
The DOM is a more complex API than SimpleXML, but has more powerful features, particularly for writing. You can switch between the two using the functions simplexml_import_dom() and dom_import_simplexml().

In this case, we want to find all xlink:href attributes. Looking at the xmlns attributes at the top of the file, we see these are in the http://www.w3.org/1999/xlink namespace. In XPath, you can say "has an attribute" with the syntax [@attributename], so we can use SimpleXML and XPath like this:

$simplexml->registerXpathNamespace('xl', 'http://www.w3.org/1999/xlink');
$elements_with_xlink_hrefs = $simplexml->xpath('//[@xl:href]');

For each of those, we want the attribute value:

foreach ( $elements_with_xlink_hrefs as $simplexml_element ) {
    $filename = (string)$simplexml_element->attributes('http://www.w3.org/1999/xlink')->href;
    // ...

We then want to load that file, and inject it into the document; this is easier with the DOM, but there is a complexity of having to "import" the node so that it's "owned by" the right document.

    // load the other file
    $other_document = new DOMDocument;
    $other_document->load($filename);
    // switch to DOM and add it in place
    $dom_element = dom_import_simplexml($simplexml_element);
    $dom_element->appendChild(
        $dom_element->ownerDocument->importNode(
            $other_document->documentElement
        )
    );

We can now tidy up and delete the "xlink" attributes:

    $dom_element->removeAttributeNs('http://www.w3.org/1999/xlink', 'href');
    $dom_element->removeAttributeNs('http://www.w3.org/1999/xlink', 'type');

Once we're done, we can output the whole thing back as one combined XML document:

} // end of foreach loop
echo $simplexml->asXML();