Home > Back-end >  Antlr4 Match anything (including multiple lines) between tokens
Antlr4 Match anything (including multiple lines) between tokens

Time:05-05

I want to parse markdown code blocks but I can't seem to get the rule right so it matches multiple lines correctly.

Here is my grammar (code.g4):

grammar code;

file: code ;
code: '```' CODE '```';

CODE: [a-z] ;
EOL: '\r'? '\n' -> skip;

And here is my input (code.txt):

```
foo
foo
```

```
bar
bar
```

```
baz
baz
```

When I run java org.antlr.v4.gui.TestRig code file -tree code.txt, I get:

line 3:0 extraneous input 'foo' expecting '```'
line 8:0 extraneous input 'bar' expecting '```'
line 13:0 extraneous input 'baz' expecting '```'
(file (code ``` foo foo ```) (code ``` bar bar ```) (code ``` baz baz ```)

I want it to match the whole code block as one token so I can parse it as one stream of bytes. What am I missing in my grammar?

(I'm using Antrl 4.10.1 and openjdk version "11.0.15" 2022-04-19.)

CodePudding user response:

You've defined just a single CODE token between the back-ticks. You need one or more CODE tokens:

code: '```' CODE  '```';

enter image description here

That said, parsing Markdown with a tool like ANTLR (where there is a strict separation between lexer and parser rules) is going to be really hard. See: https://github.com/antlr/grammars-v4/issues/472

CodePudding user response:

Your CODE Lexer rule only matches lowercase ‘a’-‘z’ characters. It’s not going to match any other characters or whitespace (including new lines and carriage returns).

That said, just correcting that Lexer rule is not going to solve your problem. You’ll need to look into Lexer Modes, then, when you encounter in the outer mode, you can switch to the inner mode and matching anything (non-greedily `.*?`) followed by where you `pop your lexer mode.

(I’m pretty sure the ```’s need to be at the start of a line, so you’ll also need a predicate to only match them if they’re at the start of a line.)

  • Related