Home > Software design >  Modifying DOM once causes subsequent modifications to error
Modifying DOM once causes subsequent modifications to error

Time:10-30

I'm trying to wrap all instances of certain phrases in a <span> using PHP's DOMDocument and XPath. I've based my logic off of this answer from another post, but this is only allowing me to select the first match within a node, when I need to select all matches.

Once I modify the DOM for the first match, my subsequent loops cause an error, stating Fatal error: Uncaught Error: Call to a member function splitText() on bool at the line that beings with $after. I'm pretty sure this is being caused by modifying the markup, but I've been unable to figure out why.

What am I doing wrong here?

/**
 * Automatically wrap various forms of CCJM in a class for branding purposes
 *
 * @link https://stackoverflow.com/a/6009594/654480
 *
 * @param string $content
 * @return string
 */
function ccjm_branding_filter(string $content): string {
    if (! (is_admin() && ! wp_doing_ajax()) && $content) {
        $DOM = new DOMDocument();

        /**
         * Use internal errors to get around HTML5 warnings
         */
        libxml_use_internal_errors(true);

        /**
         * Load in the content, with proper encoding and an `<html>` wrapper required for parsing
         */
        $DOM->loadHTML("<?xml encoding='utf-8' ?><html>{$content}</html>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

        /**
         * Clear errors to get around HTML5 warnings
         */
        libxml_clear_errors();

        /**
         * Initialize XPath
         */
        $XPath = new DOMXPath($DOM);

        /**
         * Retrieve all text nodes, except those within scripts
         */
        $text = $XPath->query("//text()[not(parent::script)]");

        foreach ($text as $node) {
            /**
             * Find all matches, including offset
             */
            preg_match_all("/(C\.? ?C\.?(?:JM| Johnson (?:&|&amp;|&#38;|and) Malhotra)(?: Engineers, LTD\.?|, P\.?C\.?)?)/i", $node->textContent, $matches, PREG_OFFSET_CAPTURE);

            /**
             * Wrap each match in appropriate span
             */
            foreach ($matches as $group) {
                foreach ($group as $key => $match) {
                    /**
                     * Determine the offset and the length of the match
                     */
                    $offset = $match[1];
                    $length = strlen($match[0]);

                    /**
                     * Isolate the match and what comes after it
                     */
                    $word  = $node->splitText($offset);
                    $after = $word->splitText($length);

                    /**
                     * Create the wrapping span
                     */
                    $span = $DOM->createElement("span");
                    $span->setAttribute("class", "__brand");

                    /**
                     * Replace the word with the span, and then re-insert the word within it
                     */
                    $word->parentNode->replaceChild($span, $word);
                    $span->appendChild($word);

                    break; // it always errors after the first loop
                }
            }
        }

        /**
         * Save changes, remove unneeded tags
         */
        $content = implode(array_map([$DOM->documentElement->ownerDocument, "saveHTML"], iterator_to_array($DOM->documentElement->childNodes)));
    }

    return $content;
}
add_filter("ccjm_final_output", "ccjm_branding_filter");

Example content (all instances of "C.C. Johnson & Malhotra, P.C." and "CCJM" are matched for, but only the first can be successfully modified):

C.C. Johnson & Malhotra, P.C. (CCJM) was an integral member of a large Design Team for a 16.5-mile-long Public-Private Partnership (P3) Purple Line Project. The east-west light rail system extends from New Carrollton in PG County, MD to Bethesda in MO County, MD with 21 stations and one short tunnel. CCJM was Engineer of Record (EOR) for the design of eight (8) Bridges and design reviews for 35 transit/highway bridges and over 100 retaining walls of different lengths/types adjacent to bridges and in areas of cut/fill. CCJM designed utility structures for 42,000 LF of relocated water mains and 19,000 LF of relocated sewer mains meeting Washington Suburban Sanitary Commission (WSSC), Md Dept of Transportation (MDOT) MTA, and Local Standards.

EDIT 1: Doing some testing, when I output $node->textContent, I see that it changes after the first loop... so I think what's happening is that after I do $node->splitText($offset), it's actually updating the entire node, so subsequent offsets don't work.

CodePudding user response:

First of all, I don't think foreach ($matches as $group) is correct here - if you check what $matches contains, that is the same matches twice, but you probably don't want to be wrapping them into spans twice. So that foreach loop should be removed, and the following one should go over $matches[0] only instead.

And second, I think your offset problem can simply be solved, if you just "mount the horse backwards" - don't replace the found matches from first to last, but in the opposite order. Then you will only ever be manipulating the structure "behind" the current position, so whatever changes occur there, will not influence the position of the previous matches.

        /**
         * Wrap each match in appropriate span
         */
        //foreach ($matches as $group) {
        $group = array_reverse($matches[0]);
            foreach ($group as $key => $match) {
                /**
                 * Determine the offset and the length of the match
                 */
                $offset = $match[1];
                $length = strlen($match[0]);

                /**
                 * Isolate the match and what comes after it
                 */
                $word  = $node->splitText($offset);
                $after = $word->splitText($length);

                /**
                 * Create the wrapping span
                 */
                $span = $DOM->createElement("span");
                $span->setAttribute("class", "__brand");

                /**
                 * Replace the word with the span, and then re-insert the word within it
                 */
                $word->parentNode->replaceChild($span, $word);
                $span->appendChild($word);

                //break; // it always errors after the first loop
            }
        //}

Result I am getting with your sample input data is the following (live example here, https://3v4l.org/kbSQ8)

<p><span class="__brand">C.C. Johnson &amp; Malhotra, P.C.</span> (<span
class="__brand">CCJM</span>) was an integral member of a large Design Team
for a 16.5-mile-long Public-Private Partnership (P3) Purple Line Project.
The east-west light rail system extends from New Carrollton in PG County,
MD to Bethesda in MO County, MD with 21 stations and one short tunnel.
<span class="__brand">CCJM</span> was Engineer of Record (EOR) for the
design of eight (8) Bridges and design reviews for 35 transit/highway
bridges and over 100 retaining walls of different lengths/types adjacent to
bridges and in areas of cut/fill. <span class="__brand">CCJM</span>
designed utility structures for 42,000 LF of relocated water mains and
19,000 LF of relocated sewer mains meeting Washington Suburban Sanitary
Commission (WSSC), Md Dept of Transportation (MDOT) MTA, and Local
Standards.</p>
  • Related