What is required to store these values in conceptual order when parsing and building an AST?-CodePudding

I am working on a programming language, say we are focusing on type definitions, and say the type definitions are specified in a certain order. Say, too, we have "macros" which are lazily evaluated and we essentially inject the appropriate code where the macro is. Ignore for this question the idea of processing the macros first, then processing the rest, think of them as processed in order they are encountered. Then for example, we might have this in a file/module:

INSERT z
type a
type b
INSERT x
type c
type d
INSERT y

MACRO y {
  type p
  type q
  INSERT w
}

MACRO z {
  type t
  type v
}

MACRO x {
  type h
  type i
}

MACRO w {
  type l
  type m
  type n
}

The resulting AST should look something like this:

[
  { type: 't' },
  { type: 'v' },
  { type: 'a' },
  { type: 'b' },
  { type: 'h' },
  { type: 'i' },
  { type: 'c' },
  { type: 'd' },
  { type: 'p' },
  { type: 'q' },
  { type: 'l' },
  { type: 'm' },
  { type: 'n' },
]

// or more concisely visualized as:
t v a b h i c d p q l m n

That is, it is in the order you would conceptually expect.

The naive algorithm would do this though:

a b c d p q t v h i l m n

That is, it would do something like this to start parsing:

WAIT z
a
b
WAIT x
c
d
WAIT y

It would insert placeholders for the things it hasn't encountered yet, and come back to them to fill them in later. But it would have an array and simply push them into the array as it encountered each next one, resulting in that second ordering.

push('a')
push('b')
push('c')
push('d')
push('h')
push('i')
...

My question is, how would you "insert these in conceptual order" (as opposed to discovery order)?

At first I was thinking, maybe use a SortedSet, and have the key be the index, and use a decimal for the later encountered ones or something like that (kind of like how you might sort items by a position property in a database, as items are drag-n-drop ordered haphazardly). But I have a hard time taking that spark of a thought further.

Then I was thinking to keep a pointer to the WAIT objects, and when the macro was encountered, you would find the index of the WAIT, and insert all the items into the array starting at that array index. But that seems less than ideal, having to move all the subsequent items in the array every time you encounter the macro results...

Are there any other magic tricks than these two ways for doing it? How would you solve this problem and create a sorted list or even a sorted hash table, where the values are in the value you would conceptually expect them to be (the first example)?

Maybe a third approach would be to sort them after everything is parsed and use some sort of special index keys? I'm not sure.

CodePudding user response：

One idea is to create a nested array. When an INSERT is encountered, don't attempt to insert the macro definition, but insert a nested object (array) that initially is empty, and log that object reference in a map keyed by the macro name.

When eventually the MACRO instruction is encountered, populate that array which you find in that map. If no such entry exists yet, then create it.

If an INSERT occurs when the mentioned macro already has such array reference then use that one.

For the example this leads to a structure which evolves as follows:

After processing INSERT z:

result: [
  { macro: array1 }
]

macros: {
  z: array1  // same reference as above; array is currently empty
}

After INSERT x:

result: [
  { macro: array1 },
  { type: "a" },
  { type: "b" },
  { macro: array2 }
]

macros: {
  z: array1,
  x: array2
}

After INSERT y:

result: [
  { macro: array1 },
  { type: "a" },
  { type: "b" },
  { macro: array2 },
  { type: "c" },
  { type: "d" },
  { macro: array3 }
]

macros: {
  z: array1,
  x: array2,
  y: array3
}

During the recursive processing of macro y, the deeper result (marked with an accent) looks like this when INSERT w is processed:

result': [
  { type: "p" },
  { type: "q" },
  { macro: array4 }
]

result: [
  { macro: array1 },
  { type: "a" },
  { type: "b" },
  { macro: array2 },
  { type: "c" },
  { type: "d" },
  { macro: array3 }
]

macros: {
  z: array1,
  x: array2,
  y: array3
  w: array4
}

Once all the macro definitions have been processed, these arrayX objects get populated, and that can be seen via the macros map, and (as they are the same arrrays) via the result array. The results array will eventually look like this:

result [
  { macro: [
            { type: "t" },
            { type": "v" }
           ]
  },
  { type: "a" },
  { type: "b" },
  { macro: [
            { type: "h" },
            { type: "i" }
           ]
  },
  { type: "c" },
  { type: "d" },
  { macro: [
            { type: "p" },
            { type: "q" },
            { macro: [
                      { type: "l" },
                      { type: "m" },
                      { type: "n" }
                     ]
            }
           ]
  }
]

It is not visible in this presentation, but if a macro is inserted multiple times, the corresponding array is not duplicated. The result will then just have duplicate references to the same array.

It then remains to flatten that nested structure, but that is just a matter of a depth-first traversal.

Here is a JavaScript solution that does all this:

function parse(input) {
    function graph() {
        const result = [];
        while (true) {
            const instruction = iterator.next().value;
            if (instruction === undefined || instruction == "}") return result;
            const arg = iterator.next().value;
            switch (instruction) {
            case "type":
                result.push({type: arg});
                break;
            case "INSERT":
                macros[arg] ??= []; // Create the macro array as empty (temporarily) if it doesn't exist
                result.push({macro: macros[arg]});
                break;
            case "MACRO":
                const token = iterator.next().value;
                if (token != "{") throw "Syntax error: expected '{', got "   token;
                if (macros[arg]?.length) throw "Duplicate definition of macro "   arg;
                macros[arg] ??= []; // Only create macro array if it wasn't referenced before
                macros[arg].push(...graph()); // Recursion to populate macro array
                break;
            default:
                throw "unexpected token";
            }
        }
    }

    function* values(nodes) {
        for (let node of nodes) {
            if (node.type) yield node;
            else if (node.macro) yield* values(node.macro);
        }
    }
    
    const macros = {}; // Map of macro definitions
    const iterator = input.match(/\S /g)[Symbol.iterator]();
    const result = graph();
    return Array.from(values(result));
}

// Demo
let input = `
INSERT z
type a
type b
INSERT x
type c
type d
INSERT y

MACRO y {
  type p
  type q
  INSERT w
}

MACRO z {
  type t
  type v
}

MACRO x {
  type h
  type i
}

MACRO w {
  type l
  type m
  type n
}
`;

console.log(parse(input));

Be aware that this language can create cycles, and if you do, the above values() iterator will keep yielding values, so that the Array.from call will never finish.