What is a valid character in an identifier called?-CodePudding

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.

CodePudding user response：

Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C 23. So I guess it's pretty well mainstream.

UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.

That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):

<Identifier> := <Start> <Continue>* (<Medial> <Continue> )*

A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)

That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.

You'll see that in a number of grammar documents:

Python

identifier   ::=  xid_start xid_continue*

Rust

IDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
                      | _ XID_Continue

C

identifier:
        identifier-start
        identifier identifier-continue

So that's what I'd suggest. But there are many other possibilities:

Swift: Calls the sets identifier-head and identifier-characters
Java: Calls them JavaLetter and JavaLetterOrDigit; Defines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.