Home > Mobile >  Remove all nodes from XML but specific ones in PHP
Remove all nodes from XML but specific ones in PHP

Time:08-11

I have an XML from Google with a content like this:

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
 <channel>
  <title>E-commerce's products.</title>
  <description><![CDATA[Clothing and accessories.]]></description>
  <link>https://www.ourwebsite.com/</link>
  <item>
   <title><![CDATA[Product #1 title]]></title>
   <g:brand><![CDATA[Product #1 brand]]></g:brand>
   <g:mpn><![CDATA[5643785645]]></g:mpn>
   <g:gender>Male</g:gender>
   <g:age_group>Adult</g:age_group>
   <g:size>Unica</g:size>
   <g:condition>new</g:condition>
   <g:id>fr_30763_06352</g:id>
   <g:item_group_id>fr_30763</g:item_group_id>
   <link><![CDATA[https://www.ourwebsite.com/product_1_url.htm?mid=62367]]></link>
   <description><![CDATA[Product #1 description]]></description>
   <g:image_link><![CDATA[https://data.ourwebsite.com/imgprodotto/product-1_big.jpg]]></g:image_link>
   <g:sale_price>29.25 EUR</g:sale_price>
   <g:price>65.00 EUR</g:price>
   <g:shipping_weight>0.5 kg</g:shipping_weight>
   <g:featured_product>y</g:featured_product>
   <g:product_type><![CDATA[Product #1 category]]></g:product_type>
   <g:availability>in stock</g:availability>
   <g:availability_date>2022-08-10T00:00-0000</g:availability_date>
   <qty>3</qty>
   <g:payment_accepted>Visa</g:payment_accepted>
   <g:payment_accepted>MasterCard</g:payment_accepted>
   <g:payment_accepted>CartaSi</g:payment_accepted>
   <g:payment_accepted>Aura</g:payment_accepted>
   <g:payment_accepted>PayPal</g:payment_accepted>
  </item>
  <item>
   <title><![CDATA[Product #2 title]]></title>
   <g:brand><![CDATA[Product #2 brand]]></g:brand>
   <g:mpn><![CDATA[573489547859]]></g:mpn>
   <g:gender>Unisex</g:gender>
   <g:age_group>Adult</g:age_group>
   <g:size>Unica</g:size>
   <g:condition>new</g:condition>
   <g:id>fr_47362_382936</g:id>
   <g:item_group_id>fr_47362</g:item_group_id>
   <link><![CDATA[https://www.ourwebsite.com/product_2_url.htm?mid=168192]]></link>
   <description><![CDATA[Product #2 description]]></description>
   <g:image_link><![CDATA[https://data.ourwebsite.com/imgprodotto/product-2_big.jpg]]></g:image_link>
   <g:sale_price>143.91 EUR</g:sale_price>
   <g:price>159.90 EUR</g:price>
   <g:shipping_weight>8.0 kg</g:shipping_weight>
   <g:product_type><![CDATA[Product #2 category]]></g:product_type>
   <g:availability>in stock</g:availability>
   <g:availability_date>2022-08-10T00:00-0000</g:availability_date>
   <qty>1</qty>
   <g:payment_accepted>Visa</g:payment_accepted>
   <g:payment_accepted>MasterCard</g:payment_accepted>
   <g:payment_accepted>CartaSi</g:payment_accepted>
   <g:payment_accepted>Aura</g:payment_accepted>
   <g:payment_accepted>PayPal</g:payment_accepted>
  </item>
  ...
 </channel>
</rss>

I need to produce a XML file purged from all the tags inside <item> except for <g:mpn>, <link>, <g:sale_price> and <qty>.

In the example above, the result should be

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
 <channel>
  <title>E-commerce's products.</title>
  <description><![CDATA[Clothing and accessories.]]></description>
  <link>https://www.ourwebsite.com/</link>
  <item>
   <g:mpn><![CDATA[5643785645]]></g:mpn>
   <link><![CDATA[https://www.ourwebsite.com/product_1_url.htm?mid=62367]]></link>
   <g:sale_price>29.25 EUR</g:sale_price>
   <qty>3</qty>
  </item>
  <item>
   <g:mpn><![CDATA[573489547859]]></g:mpn>
   <link><![CDATA[https://www.ourwebsite.com/product_2_url.htm?mid=168192]]></link>
   <g:sale_price>143.91 EUR</g:sale_price>
   <qty>1</qty>
  </item>
  ...
 </channel>
</rss>

I've looked at SimpleXML, DOMDocument, XPath docs but I couldn't find the way to exclude specific elements. I don't want to select by name the nodes I have to delete, as in a future Google could add some nodes and they will not be deleted by my script.

I've also tried to loop through namespaced elements with SimpleXML and unset them if not matched with the nodes I have to keep:

$g = $element->children($namespaces['g']); //$element is the SimpleXMLElement of <item> tag
foreach ($g as $gchild) {
    if ($gchild->getName() != "mpn") {  //for example
        unset($gchild);
    }
}

but the code above doesn't remove all nodes except for <g:mpn>, for example.

PS: consider the fact that the XML contains both namespaced and not namespaced elements

Thank you in advance.

EDIT: I've managed to do this with the following code:

$elementsToKeep = array("mpn", "link", "sale_price", "qty");

$domdoc = new DOMDocument();
$domdoc->preserveWhiteSpace = FALSE;
$domdoc->formatOutput = TRUE;
$domdoc->loadXML($myXMLDocument->asXML());  //$myXMLDocument is the SimpleXML document related to the original XML
$xpath = new DOMXPath($domdoc);

foreach ($element->children() as $child) {
    $cname = $child->getName();
    if (!in_array($cname, $elementsToKeep)) {
        foreach($xpath->query('/rss/channel/item/'.$cname) as $node) {
            $node->parentNode->removeChild($node);
        }
    }
}

$g = $element->children($namespaces['g']);
foreach ($g as $gchild) {
    $gname = $gchild->getName();
    if (!in_array($gname, $elementsToKeep)) {
        foreach($xpath->query('/rss/channel/item/g:'.$gname) as $node) {
            $node->parentNode->removeChild($node);
        }
    }
}

I've used DOMDocument and DOMXPath and two loops on no-namespaced tags and namespaced tags, in order to use the removeChild function of DOMDocument.

Really there is not a cleaner solution?? Thanks again

CodePudding user response:

Somewhat simpler:

$items = $xpath->query('//item');
foreach($items as $item) {
        $targets = $xpath->query('.//*',$item);
        foreach($targets as $target) {
            if (!in_array($target->localName, $elementsToKeep)) {
                $target->parentNode->removeChild($target);
            }
        };
    };
  • Related