Home > Net >  How CPython handles multiline input in REPL?
How CPython handles multiline input in REPL?

Time:07-02

Python's REPL reads input line by line. However, function definitions consist from multiple lines.

For example:

>>> def answer():
...   return 42
...
>>> answer()
42

How does CPython's parser request additional input after partial def answer(): line?

CodePudding user response:

Python's REPL reads input line by line.

That statement is technically correct, but it's somewhat misleading. I suppose you got it from some Python "tutorial"; please be aware that it is, at best, an oversimplification, and that it is quite possible that you will encounter other oversimplifications in the tutorial.

The Python REPL does read input line by line, in order to avoid reading too much. This differs from the way Python reads files; these are read in larger blocks, for efficiency. If the REPL did that, then the following wouldn't work:

>>> print(f"*******   {input()}   *******")
Hello, world
*******   Hello, world   *******

because the line intended as input to the expression would have already been consumed before the expression was evaluated. (And, of course, the whole point of the REPL is that you immediately see the result of executing a statement, rather than having to wait for the entire input to be read.)

So the REPL only reads lines as needed, and it does read whole lines. But that doesn't mean that it executes line-by-line. It reads an entire command, then compiles the command, and then executes it, printing the result.

That doesn't answer the question as to how the REPL knows that it has reached the end of a command, though. To answer that, we have to start with the Python grammar, conveniently reproduced in the Python documentation.

The first five lines of that grammar are the five different top-level targets of the parser. The first two, file and interactive, are the top-level targets used for reading files and for use in an interactive session. (The others are used in different parsing contexts, and I'm not going to consider them here.)

file and interactive are very different grammars. The file target is intended to parse an entire file, consisting of an optional list of statements ([statements]) followed by an end-of-file markers (ENDMARKER). In contrast, the interactive target reads a single statement_newline, whose definition is a few lines later in the grammar:

statement_newline:
    | compound_stmt NEWLINE 
    | simple_stmts
    | NEWLINE 
    | ENDMARKER 

Here, simple_stmts is a single line consisting of a sequence of ;-separated simple statements, followed by a NEWLINE:

>>> a = 3; print(a)
3

The import aspect of the definition of statement_newline is that every option either ends with (or is) a NEWLINE, or is the end of the file itself.

None of the above has anything to do with actually reading input, because the Python parser --like most language parsers-- is not responsible for handling input. As is usual, the parser takes as input a sequence of tokens, which it requests one at a time as needed. In the grammar, tokens are represented either with CAPITAL_LETTERs (NEWLINE) or as quoted literals ('if' or ' '), which represent themselves.

These tokens come from the lexical analyser (the "lexer" in common usage), which is responsible for acquiring input as necessary and turning it into a token stream by:

  • recognising classes of tokens with the same syntactic usage (like NUMBER and NAME, whose precise characters are not important to the parser, although they will obviously be needed later on in the process).
  • recognising individual keyword tokens (the quoted literals in the grammar), which includes operator tokens. (It might sound odd to call a keyword, but from the viewpoint of the lexer, that's what it is: a particular sequence of characters which make up a unique token.)
  • fabricating other tokens as needed. In Python, these have to do with the way leading whitespace is handled; the generated tokens are NEWLINE, INDENT and DEDENT.
  • ignoring comments and irrelevant whitespace.

The NEWLINE token represents a newline character (or, as it happens, the two-byte sequence \r\n sometimes used as a newline marker, for example by Windows or in many internet protocols). But not every newline character is turned into a NEWLINE token. Newlines which occur inside triple-quoted strings are considered ordinary characters. A newline immediately following a \ indicates that the next physical line is logically a continuation of the current input line. Newline characters inside parenthetic syntaxes ((...), [...], and {...}) are considered ignorable whitespace. And finally, in one of the few places where the lexer distinguishes between file and interactive input, the newline at the end of a line containing only whitespace and possibly a comment is ignored, unless the input is interactive and the line is completely empty.

The distinction in the last rule is required in order to implement the REPL rule that an empty line terminates a multi-line compound statement, which is not the case in file input. In file input, a compound statement terminates when a another statement is encountered at the same indent level, but that rule isn't suitable for interactive input, because it would require reading the first line of the next statement.

The fact that bracketed newlines are considered ignorable whitespace requires the lexer to duplicate a small amount of the work of the parser. In particular, the lexer maintains its own stack of open parenthesis/brace/bracket, which lets it track the tokens ()[]{}. Newline characters encountered in the input stream are ignored unless the bracket stack is empty. The slight duplication of effort is annoying but sometimes such deviations from perfection are necessary.

If you're interested in the way that INDENT and DEDENT are constructed, you can read about it in the reference manual; it's interesting, but not relevant here. (NEWLINE handling is also described in the reference manual section on Lexical Analysis, but I summarised it above because it is relevant to this question.)

So, to get back to the original question: How does the REPL know that it has read a complete command? The answer is simple: it asks the parser to recognise a single statement_newline target. As noted above, that construct is terminated by a NEWLINE token, and when the NEWLINE token which terminates the statement_newline target is encountered, the parser returns the resulting AST to the REPL, which proceeds to compile and execute it.

Not all NEWLINEs match the end of statement_newline, as you can see with a careful reading of the grammar. In particular, NEWLINEs inside compound statements are part of the compound statement syntax. The grammar for compound statements does not allow two consecutive NEWLINEs, but that can never happen when reading from a file because the lexical analyser does not produce a NEWLINE token for a blank line, as noted above. In interactive input, though, the lexical analyser does produce a NEWLINE token for a blank line, so it is possible for the parser to receive two consecutive NEWLINEs. Since the compound statement syntax doesn't include the second one, it becomes part of the statement_newline syntax, thereby terminating the parser's target.

CodePudding user response:

It depends on the code you want to insert into the console. But in this case, as Python detects the keyword def referring to a function declaration, it will initiate a process that detects the end of the function code by looking at its indentation.

def a():
  if 1==1:
    if not 1==1:
      pass
    else:
      return "End of execution"
#End of function

As you can see, the indentation of a function, or any similar structure is fundamental when writing it on multiple lines on the Python console. Here Python reads line by line until it detects and end on the normal function spacing, so it continues reading instructions outside a().

  • Related