Home > database >  How to get a quote-enclosed string and discard the quote itself
How to get a quote-enclosed string and discard the quote itself

Time:08-10

I would like to parse text enclosed in single quotes as just the text itself. For example:

"hello" --> hello

I'm able to parse the string with the quotes in antlr using the following rule:

grammar Test;

root
    : string EOF
    ;

string
    : QUOTE WORD QUOTE
    ;

WORD
    : [a-zA-Z0-9-] 
    ;

QUOTE
    : '"'
    ;

And with the input text "mrrogers":

enter image description here

However I'm not sure how to 'discard' the S_QUOTE values, I've tried doing the following two items and it seems like I'm on the wrong course:

fragment QUOTE
    : '\''
    ;

And:

QUOTE
    : '\'' -> skip
    ;

What would be the proper way to do this?

CodePudding user response:

These are very simplistic rules for a String type. How would you handle a string that should contain a ', for example. (Maybe you're string content is as simple as your word rule, but it looks more like this is just a starting point.

It will probably serve you well to take a look at the String rules in a grammar for a language with strings similar to what you want to allow. (You can find many grammars here. (It's pretty common to need to use Lexer modes to properly parse a String token)

I think you'll find that you need to capture the initial and terminal ' (or ") characters in the Lexer rule, so they will, necessarily be a part of the token. It's really trivial to strip thee first and last character from your token to get the string content from the token in your ParseTree.

CodePudding user response:

IMO it's better to just keep the quotes in the token and eliminate them at the stage where you need the text. This also allows you to handle special case, like the conversion of double-quotes (two consecutive quote char) to single quotes or handle escape codes (if this is something you want to support).

CodePudding user response:

If you separate your grammar into a Lexer grammar and a Parser grammar, you can use lexical modes to control which tokens are emitted for use by the parser.

As @MikeCargal has intimated, this is a simplistic solution but may help you see how your grammars may be structured.

TestLexer.g4

lexer grammar TestLexer;

tokens { WORD }

NEWLINE
    : [\n\r]
    ->channel(HIDDEN)
    ;

QUOTE
    : '"'
    ->pushMode(STRING_MODE),channel(HIDDEN)
    ;

mode STRING_MODE;

WORD
    : [a-zA-Z0-9-] 
    ;
    
STRING_MODE_QUOTE
    : '"'
    ->popMode,channel(HIDDEN)
    ;

The tokens { WORD } is an instruction to ANTLR to include the WORD token in the generated TestLexer.tokens file. This makes it visible to the parser.

The pushMode(STRING_MODE) is a way to have the lexer only emit tokens defined in the section mode STRING_MODE. ANTLR maintains a stack of modes, lexing starts out in DEFAULT_MODE and each pushMode() pushes a new mode onto the stack as the current mode governing which tokens will be emitted. Each popMode pops the current mode off the stack and the next one on the stack takes precedence.

TestParser.g4

parser grammar TestParser;

options { tokenVocab=TestLexer; }
root
    : string EOF
    ;
    
string
    : WORD
    ;
  • Related