Home > OS >  What is the correct terminology for the process of reading a simple config and creating data structu
What is the correct terminology for the process of reading a simple config and creating data structu

Time:12-16

Reading is defined as interpreting and understanding of a written material.

Parsing is defined as analyzing relationships between words in a written material and grouping those words according to underlying grammar.

My question is, what is the correct terminology for the process of reading a simple config and creating data structures corresponding to the data we acquired, parsing or reading?

I read many places about this distinction but I wasn't able to arrive at a conclusion by myself. In many places these terms used in an interchangeable manner. For example:

1. Reading a Lisp object means parsing a Lisp expression in textual form and producing a corresponding Lisp object.
From <http://zvon.org/other/elisp/Output/SEC256.html> 


2. In the programming language Lisp, the reader or read function is the parser which converts the textual form of Lisp objects to the corresponding internal object structure.
From <https://en.wikipedia.org/wiki/Lisp_reader> 

So I have a mixed understanding for parsing (grouping tokens in to syntactical elements) and creating data structures by using the elements we get from parsing (interpreting). I would much appreciate clear explanation of these topics and advision of sources where I can make further research about these topics.

CodePudding user response:

The thing is that Lisp is peculiar to that regard, and is probably not a good starting point - precisely because most languages aren't Lisp.

Usually, in a typical compiler, things work (roughly ...) as follows:

  • Your compiler gets a file which is, for all intent and purposes, just a bunch of characters at this point.

  • Now, there is a first phase called lexical analysis (or tokenization), which "breaks" those characters into pieces with some meaning attached (e.g., upon reading int x = 13, it will produce four tokens, something like [("int", TYPE_KEYWORD), ("x", VAR_IDENTIFIER), ("=", ASSIGN_SYMBOL), ("13", NUMBER)]. At this point, there are no real checks occuring as long as you are not producing complete garbage: the lexer would typically be happy with x x 13 = = x as input. It might however reject abc"def if you can't have quotes " inside a variable name !

  • Then, and only then, is the time where the compiler performs what it usually refered to as parsing: the tokens produced in the lexing phase are matched to a grammar, to see if "things have a normal shape"; so = x = 34 int will be rejected, but int abcd = "twelve"; abcd["hello" 25.76] = 5; will not.

  • The parser (which does the previous phase) will typically produce a parse tree, saying roughly what the different elements are (e.g. function definition, with some arguments, an assignement to some variable, a while-loop etc). This really is some information only available to this compiler, and this tree only exists for the compilation phase; it does not exist in the compiled code, etc. In particular, you can have a compiler written in a language A which compiles a language B, and the parse tree would be a data structure of the language A.

  • There are more phases occuring in compilation, but as far as we are concerned, this is it. Of course, much more needs to be done (analysis to check e.g. the program type correctness, checking that all functions being called actually have a definition, etc etc, and finally producing a compiled version of the code !), but parsing is over at this point.

You can have an example of the grammar I mentionned above for Python's function definition, where you see how a "valid" function definition has to match some "shape", defined by the grammar, itself being defined in terms of tokens (~ groups of characters)

The thing with Lisp is the following:

  • Lisp code is, more or less like any other language, written in files.
  • However, what happens in (Common) Lisp is that this file is "read" (as if) by a Common Lisp function called read. This function reads characters, and returns a Lisp object (typically, a list, with symbols, numbers and nested lists, etc). That is, if your file contains the characters (list 10 "abcd") (which is 16 characters), read will return the Lisp list (list 10 "abcd"), a list of length three containing a symbol, an integer and a string.
  • Now, that Lisp object is the thing being evaluated (and compiled, if needed). Said differently, the grammer and therefore the semantics of the language are defined in terms of Lisp objects, not in terms of characters/tokens.

You can see what I mean if you check Common Lisp's reference for function definition: no characters are being referenced, only symbols, lists and other Lisp objects.

Of course, a Lisp compiler will still have to work: determine what symbols correspond to what bindings, check that the grammar is actually respected, deal with memory and what not, but the reading/parsing phase is fundamentally distinct. In particular, the sentence

Reading a Lisp object means parsing a Lisp expression in textual form and producing a corresponding Lisp object.

has no equivalent in other languages. There is no "corresponding Python object" to the bunch of characters foo = bar[42]. There is one, on the other hand, for Lisp's characters (setf foo 42) -- a list of length 3, containing two symbols and a number.

CodePudding user response:

There is no single “correct terminology” without context. Words are used with different meanings in different contexts.

Often when discussing computing, “reading” means retrieving data, and “parsing” means analyzing a string of characters or other symbols relative to some rules for their grammar and interpretation. However, these meanings should not be taken as absolute; the words can be used in other ways.

  • Related