Making a shell grammar for a recursive descent parser-CodePudding

For a school project, we're assign to write a a mini-shell: a small shell with limited functionality, written in C.

I've been seduced by the apparent simplicity of a recursive descent parser, and decided to go write by hand such parser.

The shell should handle (among built-in etc):

Basic redirection <, > with here-doc <<.
Pipes |
Logical OR || and logical AND && with parenthesis ( ) (we're not assign to behave like bash and do subshell on parenthesis, they're used like in mathematics for priority).

I've wrote and implemented this grammar, only for redirections and pipes, that work without any problems (in BNF notation):

<pipeline>          ::=     <simple_command>    '|' <pipeline>
                      |     <simple_command>    'ε'

<simple_command>    ::=     <io_list>           <word><cmd_suffix>
                      |     <io_list>           <word>
                      |     <io_list>
                      |     <word>              <cmd_suffix>
                      |     <word>

<io_list>           ::=     <io_redirect>       <io_list>
                      |     <io_redirect>

<cmd_suffix>        ::=     <io_list>           <cmd_suffix>
                      |     <io_list>
                      |     <word>              <cmd_suffix>
                      |     <word>

<io_redirect>       ::=     '<'             <word>
                      |     '>'             <word>
                      |     '>>'            <word>
                      |     '<<'            <word>

I then tried to use this grammar for && and || with ( ) :

<complete_command>  ::=     <and_or>        'ε'

<and_or>            ::=     <pipeline>      '&&'        <and_or>
                      |     <pipeline>      '||'        <and_or>
                      |     <pipeline>

<command>           ::=     <simple_command>
                      |     '('     <and_or>    ')'

<pipeline>          ::=     <command>           '|'         <pipeline>
                      |     <command>

<simple_command>    ::=     <io_list>       <word>          <cmd_suffix>
                      |     <io_list>       <word>
                      |     <io_list>
                      |     <word>          <cmd_suffix>
                      |     <word>

<io_list>           ::=     <io_redirect>       <io_list>
                      |     <io_redirect>

<cmd_suffix>        ::=     <io_list>           <cmd_suffix>
                      |     <io_list>
                      |     <word>              <cmd_suffix>
                      |     <word>

<io_redirect>       ::=     '<'             <word>
                      |     '>'             <word>
                      |     '>>'            <word>
                      |     '<<'            <word>

But when testing, I just realize that:

This grammar leads to a lot of recursion.
This grammar doesn't provide left associativity for the && and || operator.

Do you have any idea to write a grammar that handle left associativity? This seems a flaw of the recursive descent parser.. (or just me that runs out of ideas).

CodePudding user response：

A BNF grammar for left-associative operators needs to be left-recursive, which would seem to eliminate recursive descent parsing. Obviously that's not the case, since many expression parsers are written in some variation of recursive descent style (although it must be said that bottom-up parsing is a more natural match for an expression grammar).

The key is that recursive descent parsers can deal with grammar specifications which are not limited to simple BNF. In particular, an RD parser can trivially handle greedy repetition without recursion by translating it into a while loop; that effectively gives you a list of terms instead of a right-leaning pairwise tree, and the list can be reduced left-to-right producing a left-associative parse.

You might end up with a grammar which looks something like this. Here, { X } means "any number (including 0) of X" and ( X | Y ) means "either X or Y".

<complete_cmd> ::=  <and_or> <newline>
<and_or>       ::=  <pipeline> { ('&&' | '||') <pipeline> }        
<pipeline>     ::=  <command> { '|' <command> }
<command>      ::=  <simple_cmd> | '(' <and_or> ')'
<simple_cmd>   ::=  { <redirect> } <word> { ( <redirect> | <word> ) }
<redirect>     ::=  ( '<' | '>' | '<<' | '>>' ) <word>

The { X } construct is implemented with a while loop:

AST* parse_and_or(Scanner* scanner) {
  AST* result = parse_pipeline(scanner);
  while (1) {
    switch(token_peek(scanner)) {
      default:
        return result;
      case T_AND_IF:
      case T_OR_IF: {
        enum Token token = token_get(scanner);
        AST* pipeline = parse_pipeline(scanner);
        result = make_andor(token, result, pipeline);
      }
    }
  }
}

If you want to use operator precedence, it can be implemented in a recursive descent parser using "precedence climbing", in which the recursive call includes a precedence threshold and returns when an operator is found which doesn't satisfy that threshold. See, for example, the pseudocode in the Wikipedia article on operator precedence. But for simple expression grammars, it might be easier to just write the recursive rules out, one precedence level at a time.