Home > database >  How to deal with invalid characters in a string
How to deal with invalid characters in a string

Time:08-16

Let's say I allow strings of the following form:

"hello"

And, within a string, I also allow hex escape sequences of the form \xAB, and so an example string might be:

'hello \x74\x6f\x6d'
# hello tom

Of course the following is an invalid string:

"hello \x7z"

Is this something that should be handled at the Lexing stage, that is, it should raise a parsing error that the string is not in the correct format, or is this something that should not matter at the lexing phase and the only job is to "grab the string token", regardless of validity, and pass it onto the next phase, which can do some basic checks that the string contents are in fact valid (and some other checks -- maybe it doesn't contain 0x00 or it's under a certain length, etc.).

CodePudding user response:

Since there’s no “valid” way to interpret that input (t other than as a string with invalid content. You’re better off accepting it as a string and doing the escape character validation in a listener or visitor.

Trying to do things like this in the parser adds no real value, substantially complicates the parser rules, and likely results in really obtuse error messages for the user of your language.

The purpose if your grammar is to recognize that the user was attempting to create a string and create the parse tree indicating such. It’s really not much different than a user trying to assign to an undefined variable in a language that requires definitions. Your parser identifies the only possible intent, and then you can validate it and, hopefully, write good error messages describing the problem.

  • Related