Context
I'm building a set of 'extractor' functions whose purpose is to extract what looks like components from a page (using jsdom and nodejs). The final result should be these 'component' objects ordered by where they originally appeared in the page.
Problem
The last part of this process is a bit problematic. As far as I can see, there's no easy way to tell where a given element is in a given dom document's source code.
The numeric depth or css/xpath-like path also doesn't feel helpful in this case.
Example
With the given extractors...
const extractors = [
// Extract buttons
dom =>
Array.from(dom.window.document.querySelectorAll('button'))
.map(elem => ({
type: 'button',
name: elem.name,
position: ???
})),
// Extract links
dom =>
Array.from(dom.window.document.querySelectorAll('a'))
.map(elem => ({
type: 'link',
name: elem.textContent,
position: ???
link: elem.href,
})),
];
...and the given document (I know, it's an ugly and un-semantic example..):
<html>
<body>
<a href="/">Home</a>
<button>Login</button>
<a href="/about">About</a>
...
I need something like:
[
{ type: 'button', name: 'Login', position: 45, ... },
{ type: 'link', name: 'Home', position: 20, ... },
{ type: 'link', name: 'About', position: 72, ... },
]
(which can be later ordered by item.position
)
CodePudding user response:
Try treewalker, this example is a class I wrote that uses treewalker.
class Tree {
constructor(document, root, whatToShow) {
this.doc = document;
this.root = root;
this.node = whatToShow;
}
walker(filter, expandEntity = false) {
let SHOW = this.node === 1 ? NodeFilter.SHOW_ELEMENT : this.node === 3 ? NodeFilter.SHOW_TEXT : NodeFilter.SHOW_ELEMENT NodeFilter.SHOW_TEXT;
return this.doc.createTreeWalker(this.root, SHOW, filter, expandEntity);
}
walk() {
let position = 0;
let DOM = [];
let W = this.walker.call(this);
while (W.nextNode()) {
let type = W.currentNode.tagName;
let text = W.currentNode.textContent;
let pos = position ;
let obj = Object.assign({}, {
type: type,
name: text,
position: pos
});
DOM.push(obj);
}
return DOM;
}
}
const w = new Tree(document, document.body, 1);
console.log(w.walk());
<body>
<input type="checkbox" id="nav-trigger" />
<label for="nav-trigger" >
<div ></div>
</label>
<label for="nav-trigger" ></label>
<div id="main">
<section >
<article><h1>js_utils</h1>
<p>This library is a collection of useful functions that we don't want te keep writing again and again.
It also serves the purpose to help manipulate the DOM when a framework like react.js is not needed or would be overkill.</p>
<h2>npm scripts</h2>
<pre ><code> {
"build": "node ./node_modules/webpack/bin/webpack.js --config ./webpack.config.js ./src/index.js",
"generate-docs": "node_modules/.bin/jsdoc --configure .jsdoc.json --verbose",
"test": "jest",
"test-verbose": "jest --coverage --config ./jest.config.js"
}
</code></pre>
<ul>
<li><strong>build</strong>: compile source code to generate the complete library</li>
<li><strong>generate-docs</strong>: generate documentation using jsdoc comments</li>
<li><strong>test</strong>: run jest tests</li>
<li><strong>test-verbose</strong>: run tests and generate coverage reports</li>
</ul>
<p><a href="https://payouri.github.io/js_utils/index.html">Online Docs</a> </br>
<a href="https://codepen.io/Zorlimar/pen/vQQmOo">Available on CodePen</a></p></article>
</section>
</div>
<br >
<footer>
Generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.6.4</a> on Thu May 28 2020 20:51:24 GMT 0200 (GMT 02:00) using the Minami theme.
</footer>
CodePudding user response:
One possible rough way I can think of is something like:
function findPos(elem){
elem.setAttribute('data-pf', '1');
try {
return elem.ownerDocument.documentElement.outerHTML.indexOf('data-pf');
} finally {
elem.removeAttribute('data-pf');
}
}
see also: https://github.com/jsdom/jsdom#serializing-the-document-with-serialize
However on top of being imprecise, it feels like overkill and possibly badly performing (unless it's crazy slow, that's not a big problem since this task is a one-time job).