I am trying to build a basic Latex parser using pest library. For the moment, I only care about lines, bold format and plain text. I am struggling with the latter. To simplify the problem, I assume that it cannot contain these two chars: \
, }
.
lines = { line ~ (NEWLINE ~ line)* }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
inner = @{ char* }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}
main = {
SOI ~
lines ~
EOI
}
Using this webapp, we can see that my grammar eats the char after the plain text.
Input:
Before \textbf{middle} after.
New line
Output:
- lines > line
- token > text_plain > inner: "Before "
- token > text_plain > inner: "textbf{middle"
- token > text_plain > inner: " after."
- token > text_plain > inner: "New line"
If I replace ${ inner ~ ("\\" | "}" | NEWLINE) }
by ${ inner }
, it fails. If add the &
in front of the suffix, it does not work either.
How can I change my grammar so that lines and bold tags are detected?
CodePudding user response:
The rule
text_plain = ${ inner ~ ("\\" | "}" | NEWLINE) }
certainly matches the character following inner
(which must be a backslash, close brace, or newline). That's not what you want: you want the following character to be part of the next token. But it's definitely seems to me reasonable to ask what happened to that character, since the token corresponding to text_plain
clearly doesn't show it.
The answer, apparently, is a subtlety in how tokens are formed. According to the Pest book:
When the rule starts being parsed, the starting part of the token is being produced, with the ending part being produced when the rule finishes parsing.
The key here, it turns out, is what is not being said. ("\\" | "}" | NEWLINE)
is not a rule, and therefore it does not trigger any token pairs. So when you iterate over the tokens inside text_plain
, you only see the token generated by inner
.
None of that is really relevant, since text_plain
should not attempt to match the following character in any event. I suppose you realised that, because you say you tried to change the rule to text_plain = { inner }
, but that "failed". It would have been useful to know what "failure" meant here, but I suppose that it was because Pest complained about the attempt to use a repetition operator on a rule which can match the empty string.
Since inner
is a *
-repetition, it can match the empty string; defining text_plain
as a copy of inner
means that text_plain
can also match the empty string; that means that token
({ text_bold | text_plain }
) can match the empty string, and that makes token*
illegal because Pest doesn't allow applying repetition operators to a nullable rule. The simplest solution is to change inner
from char*
to char
, which forces it to match at least one character.
In the following, I actually got rid of inner
altogether, since it seems redundant:
main = { SOI ~ lines ~ EOI }
lines = { line ~ (NEWLINE ~ line)* ~ NEWLINE? }
line = { token* }
token = { text_bold | text_plain }
text_bold = { "\\textbf{" ~ text_plain ~ "}" }
text_plain = @{ char }
char = {
!("\\" | "}" | NEWLINE) ~ ANY
}