Drop the required surrounding quotes in the lexer/parser-CodePudding

I several projects I have run into a similar effect in my grammars.

I have the need to parse something like Key="Value"

So I create a grammar (simplest I could make to show the effect):

grammar test;

KEY   : [a-zA-Z0-9]  ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]  DOUBLEQUOTE ;

DOUBLEQUOTE     : '"'           ;
EQUALS          : '='           ;

entry    : key=KEY EQUALS value=VALUE;

I can now parse thing="One Two Three" and in my code I receive

key = thing
value = "One Two Three"

In all of my projects I end up with an extra step to strip those " from the value.

Usually something like this (I use Java)

String value = ctx.value.getText();
value = value.substring(1, value.length()-1);

In my real grammars I find it very hard to move the check of the surrounding " into the parser.

Is there a clean way to already drop the " by doing something in the lexer/parser?

Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".

Update:

I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for. By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.

TestLexer.g4

lexer grammar TestLexer;

KEY         : [a-zA-Z0-9] ;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS      : '=';

mode STRING_MODE;

  STRING_DOUBLEQUOTE
   : '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
   ;

  STRING
   : [ _a-zA-Z0-9.-] 
   ;

and

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }

entry : key=KEY EQUALS value=STRING ;

CodePudding user response：

Try this:

VALUE
 : DOUBLEQUOTE [ _a-zA-Z0-9.-]  DOUBLEQUOTE 
   {setText(getText().substring(1, getText().length()-1));}
 ;

Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.

EDIT

Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.

A quick demo:

TestLexer.g4

lexer grammar TestLexer;

KEY         : [a-zA-Z0-9] ;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS      : '=';

mode STRING_MODE;

  STRING_DOUBLEQUOTE
   : '"' -> type(DOUBLEQUOTE), popMode
   ;

  STRING_ATOM
   : [ _a-zA-Z0-9.-]
   ;

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }

entry : key=KEY EQUALS value;

value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;

string_atoms : STRING_ATOM*;

If you now run the Java code:

Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));

TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());

this will be printed:

One Two Three