How to capture a literal in antlr4?-CodePudding

I am looking to make a rule for a regex character class that is of the form:

 character_range
   : '[' literal '-' literal ']'
   ;

For example, with [1-5] I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?

CodePudding user response：

Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.

In the parser:

character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;

And in the lexer:

OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];

The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.

CodePudding user response：

You can also define a range of unicode characters. Try something like this in your lexer rules:

fragment LETTER: [a-zA-Z];
fragment LETTER_UNICODE: [\u0080-\uFFFF];

UTF8CHAR: ( LETTER | LETTER_UNICODE );