Keep accented characters while highlighting text (wrapping in <span> tags)-CodePudding

I am using the following code to search and highlight accented text. The problem I am facing is that it removes accented text while highlighting. Is there anyway to keep accents?

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

function highlightTerm($text, $keyword) {
    $text = iconv('utf-8', 'ISO-8859-1//IGNORE', Normalizer::normalize($text, Normalizer::FORM_D));
    $words = explode(" ", $keyword);
    $p = implode('|', array_map('preg_quote', $words));
    return preg_replace(
        "/($p)/ui", 
        '<span style="background:yellow;">$1</span>', 
        $text
    );
}

CodePudding user response：

A simple replace will not work for this. You have to split the text into words and compare the normalized words. You should use DOM to iterate and replace the text nodes. This avoids replacing the terms inside other node types (attributes, comments, ...) and takes care of escaping.

Splitting could be done with Regular Expression, however here is a specific tool for it in the ext/intl extension called IntlBreakIterator. The extension has a Collator for string compare, too.

Here is a example for whole words:

$html = <<<'HTML'
<div>
Would you like a café, Mister Kàpêk?
</div>
HTML;

// prepare the text breaker
$breaker = IntlBreakIterator::createWordInstance('en_US');
// prepare the compare
$collator = new Collator('en_US');
$collator->setStrength(Collator::PRIMARY);

// wrap terms for easy use
$terms = new Terms(
    function($word) use ($collator) {
        return $collator->getSortKey($word);
    },
    'cafe',
    'kapek'
);

// load HTML fragment into DOM
$document = new DOMDocument();
$document->loadHTML(
    "<?xml encoding='UTF-8'?>\n$html"
);
$xpath = new DOMXpath($document); 

// iterate text nodes
foreach ($xpath->evaluate('//text()') as $textNode) {
    // feed text into word breaker
    $breaker->setText($textNode->textContent);
    // prepare a fragment for new nodes
    $fragment = $document->createDocumentFragment();
    $replace = false; 
    // iterate words
    foreach ($breaker->getPartsIterator() as $word) {
        // find word in terms
        $index = $terms->indexOf($word)   1;
        if ($index > 0) {
            $replace = true;
            // wrap in a "span" element
            $span = $document->createElement('span');
            $span->textContent = $word;
            $span->setAttribute('class', 'term');
            $span->setAttribute('data-term-index', $index);
            $fragment->appendChild($span);
        } else {
            $fragment->appendChild($document->createTextNode($word));
        }
    }
    if ($replace) {
        // replace original text node with new fragment
        $textNode->parentNode->replaceChild($fragment, $textNode);
    }
}

// DOMDocument::loadHTML() will have wrapped the HTML to 
// create a whole document
$result = '';
foreach ($xpath->evaluate('//body/node()') as $node) {
    $result .= $document->saveHTML($node);
}
echo $result;

class Terms {

    private $_normalize;    
    private $_hashes;
    
    public function __construct(
        callable $normalize, 
        string ...$terms
    ) {
        $this->_normalize = $normalize;
        $this->_hashes = array_flip(
            array_map(
                function(string $term): string { 
                   $normalize = $this->_normalize;
                   return $normalize($term);
                },
                $terms
            )
        );
    }
    
    public function indexOf(string $word): int {
       $normalize = $this->_normalize;
       $hash = $normalize($word);
       return $this->_hashes[$hash] ?? -1;
    }
}

Output:

<div>
Would you like a <span  data-term-index="1">café</span>, Mister <span  data-term-index="2">Kàpêk</span>?
</div>

Extending this to partial matches is possible but it can get complex. You would have to simplify the current word (and keep track of the position) until it matches a term, then build a the output fragment.

CodePudding user response：

Here is a not-so-pretty approach to isolate the search terms in the normalized input string, then perform multibyte-safe surgery on the original string based on the offsets of the matches and the lengths of substrings.

I replaced your pattern delimiters with a symbol that preg_quote() will escape by default.

The replacements must be done in reverse so that the offset and length calculations are not skewed.

Normally this sort of task calls for preg_replace_callback(), but because the search is on the normalized string and the replacement is on the original string, the replacement step must be separated from the matching step.

I used strtr() to bruteforce the normalization because I am not very aware of the most reliable way to normalized accented characters. Feel free to replace that subprocess.

Code: (Demo)

define(
    'ACCENT_MAP',
    [
        "ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
        "А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
        "Б" => "B", "ב" => "B", "Þ" => "B",
        "Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
        "Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
        "È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
        "Ф" => "F", "Ƒ" => "F",
        "Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
        "ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
        "I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
        "Й" => "J", "Ĵ" => "J",
        "ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
        "Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
        "מ" => "M", "М" => "M", "ם" => "M",
        "Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ŉ" => "N", "Ň" => "N",
        "Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
        "פ" => "P", "ף" => "P", "П" => "P",
        "ק" => "Q",
        "Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
        "Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
        "Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
        "Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
        "В" => "V", "ו" => "V",
        "Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
        "Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
        "а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
        "б" => "b", "ב" => "b", "þ" => "b",
        "ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
        "Ч" => "ch", "ч" => "ch",
        "д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
        "è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
        "ф" => "f", "ƒ" => "f",
        "ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
        "ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
        "i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
        "й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
        "ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
        "ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
        "מ" => "m", "м" => "m", "ם" => "m",
        "ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ŉ" => "n", "ň" => "n",
        "ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
        "פ" => "p", "ף" => "p", "п" => "p",
        "ק" => "q",
        "ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
        "ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
        "т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
        "ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
        "в" => "v", "ו" => "v",
        "ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
        "ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
        "™" => "tm",
        "@" => "at",
        "Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
        "ĳ" => "ij", "Ĳ" => "ij",
        "я" => "ja", "Я" => "ja",
        "Э" => "je", "э" => "je",
        "ё" => "jo", "Ё" => "jo",
        "ю" => "ju", "Ю" => "ju",
        "œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
        "щ" => "sch", "Щ" => "sch",
        "ш" => "sh", "Ш" => "sh",
        "ß" => "ss",
        "Ü" => "ue",
        "Ж" => "zh", "ж" => "zh",
    ]);

With:

function highlightTerm($text, $keyword) {
    $mbLength = mb_strlen($text);
    $unaccented = strtr($text, ACCENT_MAP);
    $words = explode(" ", $keyword);
    $regex = implode('|', array_map('preg_quote', $words));
    if (preg_match_all("#$regex#ui", $unaccented, $m, PREG_OFFSET_CAPTURE)) {
        foreach (array_reverse($m[0]) as [$match, $offset]) {

            // normalized length
            $length = strlen($match);

            // new multibyte-safe substring
            $tag = '<span style="background:yellow;">'
                . mb_substr($text, $offset, $length)
                . '</span>';

            // new substring's multibyte character length
            $tagLength = mb_strlen($tag);

            // actual multibyte-safe replacement on original text
            $text = mb_substr($text, 0, $offset)
                . $tag
                . mb_substr($text, $offset   $length);
        }
    }
    return $text;
}

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

Output:

Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?