Home > OS >  How can I build a tree of headings based on the DOM
How can I build a tree of headings based on the DOM

Time:08-06

I have a DOM of headings. I want to output a tree from it.

const dom = new DOMParser().parseFromString(
  `<!DOCTYPE html>
  <body>
    <h1>A</h1>
    <h2>B1</h2>
    <h2>C</h2>
    <h3>D</h3>
    <h3>E</h3>
    <h4>F</h3>
    <h2>B2</h2>
  </body>`,
  "text/html"
);

This is the desired output, a hierarchical table of content tree:

const output = {
  A: {
    tag: "H1",
    children: [
      {
        B1: { tag: "H2" },
        C: {
          tag: "H2",
          children: [
            {
              D: { tag: "H3" },
              E: { tag: "H3" },
              children: [{ F: { tag: "H4" } }],
            },
          ],
        },
        B2: { tag: "H2" },
      },
    ],
  },
};

This is my attempt:

const headerTagSet = new Set(["h1", "h2", "h3", "h4"]);
const headerTags = ["h1", "h2", "h3", "h4"];

function buildTree(nodes) {
  const tree = {};
  let currentNode = tree;
  let prevLevel = 0;
  const stack = [];

  for (const node of nodes) {
    if (node.tagName && headerTagSet.has(node.tagName.toLowerCase())) {
      const currentLevel = headerTags.findIndex(
        (tag) => tag === node.tagName.toLowerCase()
      );
      const isNested = currentLevel > prevLevel;
      if (isNested) {
        const newNode = {
          [node.textContent]: {
            tag: node.tagName,
          },
        };
        stack.push(newNode);
        if (currentNode.childre) {
          currentNode.childre.push(newNode);
        } else {
          currentNode.children = [newNode];
        }
        prevLevel = currentLevel;
        currentNode = newNode;
      } else {
        currentNode[node.textContent] = { tag: node.tagName };
        stack.length = 0;
      }
    }
  }

  return tree;
}

buildTree(Array.from(doc.body.childNodes));

However, my current solution is not correct. I can't seem to figure out what the current algorithm is.

CodePudding user response:

First of all, there seems to be an error in the structure. The expected shape for one of your nodes looks like ths:

{
  E: { tag: "H3" },
  children: [{ F: { tag: "H4" } }],
}

That is, a loose children tag conected to the root single object in C's children array, but I suspect this was intended:

{
  E: { tag: "H3", children: [{ F: { tag: "H4" } }] },
}

That is, the children property connected to E. I'm going to proceed under the latter assumption for a consistent structure.


Secondly, as described at length in the comments, the desired output structure seems suboptimal to me.

The first issue is that all children arrays have a single item always, which seems pointless. A guaranteed-one-item array might as well be stripped out and the lone element moved outward to replace the unnecessary wrapper. This saves an extra loop or [0] when it comes time to use this structure.

Secondly, when working with recursive structures, self-similarity of nodes is a desirable property which makes it easier to use the object. But the expected structure has arbitrary keys based on node text contents, which would generally be long strings in a typical DOM. Having to excessively use for ... in/Object.keys/Object.values/Object.entries or assign properties dynamically with {[key]: val} tend to be antipatterns, indicating abuse of objects as arrays.

Trees are usually used in traversals (looping over children arrays), not key lookups. Flat maps/objects are better for key lookups if that's how you ultimately aim to use the structure.

For a long time in JS' history, object key order wasn't guaranteed and there was no syntactical shorthand for dynamically assigning keys (i.e. {[key]: "val"}). Many other languages still don't provide these "features" for objects because they're easy to abuse to create suboptimal structures, as I'd argue is the case here.

Furthermore, with the desired structure, two sibling headers with the same text content will overwrite each other, resulting in potentially lost data and confusing bugs.

Another red flag is when the code to create the structure is fussy and difficult to write. That suggests the code to consume the structure will also be fussy and difficult to write. The point of programming is to make things as easy as possible, so I would choose the easier structure until forced to use something harder. And even then, if a third-party API requires it, I'd generally keep using the good structure and make the transformation to (and from) the uglier one at the last moment as a wrapper on the API.

To remedy these issues, consider the following array-based, self-similar structure. Easy to build and use.

const treeifyHeaders = dom => {
  const stack = [{tag: "H0", children: []}];
  const headers = "h1, h2, h3, h4, h5, h6";

  for (const header of dom.querySelectorAll(headers)) {
    const {tagName: tag, textContent: text} = header;
    const node = {tag, text};
    let last = stack.at(-1);

    while (last.tag >= node.tag) {
      stack.pop();
      last = stack.at(-1);
    }
    
    last.children = last.children || [];
    last.children.push(node);
    stack.push(node);
  }

  return stack[0].children;
};

const dom = new DOMParser().parseFromString(
  `<!DOCTYPE html>
  <body>
    <h1>A1</h1>
    <h2>B1</h2>
    <h2>B2</h2>
    <h2>B3</h2>
    <h2>B4</h2>
    <h3>C1</h3>
    <h3>C2</h3>
    <h4>D1</h4>
    <h6>F1</h6>
    <h4>D2</h4>
    <h2>B5</h2>
    <h3>C3</h3>
    <h1>A2</h1>
    <h2>B6</h2>
  </body>`,
  "text/html"
);

console.log(treeifyHeaders(dom));
.as-console-wrapper{max-height:100%!important}

Note that the output when a header skips nesting during a pop might be handled differently than expected. For example, <h1></h1><h3></h3><h2></h2> will set both <h2> and <h3> as direct children of <h1>--probably the most reasonable result given the circumstances. Since such a case isn't in your example, I assume it doesn't need to be handled yet.

Now, if you're certain that you "accept the terms and conditions" and you want to (or have to) use the original structure, here's the code to create it. Writing the code directly was annoying, so I just transformed the easier-to-use structure into the harder-to-use-one.

If you have huge trees, the weird tree can be produced directly without the nice tree step, but until the motivation exists, I'd avoid it as premature optimization.

const makeWeirdTree = roots => {
  const o = {};

  for (const child of roots) {
    o[child.text] = {tag: child.tag};

    if (child.children) {

      // remove these brackets to get rid of the extra array wrapper
      //                       V                             V
      o[child.text].children = [makeWeirdTree(child.children)];
    }
  }

  return o;
};

const treeifyHeaders = dom => {
  const stack = [{tag: "H0", children: []}];
  const headers = "h1, h2, h3, h4, h5, h6";

  for (const header of dom.querySelectorAll(headers)) {
    const {tagName: tag, textContent: text} = header;
    const node = {tag, text};
    let last = stack.at(-1);

    while (last.tag >= node.tag) {
      stack.pop();
      last = stack.at(-1);
    }

    last.children = last.children || [];
    last.children.push(node);
    stack.push(node);
  }

  return stack[0].children;
};

const expected = {
  A: {
    tag: "H1",
    children: [{
      B1: {tag: "H2"},
      C: {
        tag: "H2",
        children: [{
          D: {tag: "H3"},
          E: {
            tag: "H3",
            children: [{
              F: {tag: "H4"}
            }]
          },
        }],
      },
      B2: {tag: "H2"}
    }]
  }
};
const dom = new DOMParser().parseFromString(
  `<!DOCTYPE html>
  <body>
    <h1>A</h1>
    <h2>B1</h2>
    <h2>C</h2>
    <h3>D</h3>
    <h3>E</h3>
    <h4>F</h3>
    <h2>B2</h2>
  </body>`,
  "text/html"
);
console.log(makeWeirdTree(treeifyHeaders(dom)));
console.log(_.isEqual(expected, makeWeirdTree(treeifyHeaders(dom))));
.as-console-wrapper {max-height: 100%!important}
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.4/lodash.min.js"></script>

Just to drive the point home, here's what basic traversal code looks like for the two structures. The version with children arrays feels a fair bit more natural than the one with objects.

const traverseNormal = (roots, depth=0) => {
  roots.forEach(({tag, text, children}) => {
    console.log("__".repeat(depth)   `${tag} => ${text}`);
    children && traverseNormal(children, depth   1);
  });
};

const traverseWeird = (roots, depth=0) => {
  Object.entries(roots).forEach(([text, {tag, children}]) => {
    console.log("__".repeat(depth)   `${tag} => ${text}`);

    if (children && children[0]) {
      traverseWeird(children[0], depth   1);
    }
  });
};

traverseNormal(treeifyHeaders(dom));
console.log("-".repeat(30));
traverseWeird(makeWeirdTree(treeifyHeaders(dom)));

Notice how if we'd made the node keys fully identical by adding empty arrays to leaves, the consumer's code becomes even simpler, avoiding the children && condition. The tradeoff is extra object allocations, which might be prohibitively expensive, but the point stands: prefer self-similarity in designing tree nodes.

  • Related