I am interested in detecting strings that contain "un-combined" or "dangling" combining characters. These are formally known as isolated combining characters.
An example of such a string would be "\u0303 hello"
, which starts with a COMBINING TILDE
that is not actually combined with anything else.
Is there an algorithm for detecting such a thing?
It seems like I can search over the string looking for "combine-able" base characters, and reject any combining character that is not preceded by such a base character. But how do I know what characters are base characters? I imagine that there are also edge cases to worry about.
My objective is to reject such strings as invalid identifiers, in a programming language that supports Unicode identifiers. But this might also be useful for other text processing tasks as well.
CodePudding user response:
Unicode 14.0 definitions D50, D51, D52 seem relevant.
You could find the first isolated combined character in an uninterrupted sequence of possibly multiple isolated combined characters by searching for combining characters that
- immediately follow something that is not a Letter (
L
), Number (N
), Punctuation (P
), Symbol (S
) or Space Separator (Zs
) or another combining character (M
).
In Java-Syntax that would be:
(?<!\p{L}|\p{N}|\p{P}|\p{S}|\p{Zs}|\p{M})\p{M}
Full runnable example (Scala, here an online interpreter]:
val rgx = """(?<!\p{L}|\p{N}|\p{P}|\p{S}|\p{Zs}|\p{M})\p{M}""".r
val examples = List(
"\u0303bad",
"ok\u0303",
"ok\u0303\u0303",
"bad\u001F\u0303"
)
for (e <- examples) {
println(rgx.findFirstIn(e).nonEmpty)
}
prints:
true
false
false
true