Home > Software design >  What is a valid character in an identifier called?
What is a valid character in an identifier called?

Time:05-18

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.

CodePudding user response:

Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C 23. So I guess it's pretty well mainstream.

UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.

That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):

<Identifier> := <Start> <Continue>* (<Medial> <Continue> )*

A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)

That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.

You'll see that in a number of grammar documents:

Python
identifier   ::=  xid_start xid_continue*
Rust
IDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
                      | _ XID_Continue 
C
identifier:
        identifier-start
        identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
Swift
Calls the sets identifier-head and identifier-characters
Java
Calls them JavaLetter and JavaLetterOrDigit
C
Defines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.
  • Related