Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit
or is_alnum
. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier
is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore
would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.
CodePudding user response:
Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C 23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue> )*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
So that's what I'd suggest. But there are many other possibilities: