Sorry for the probable trivial question but I still fail to get how streams work in node.js.
I want to parse an html file and get the path of the first script I encounter. I'd like to interrupt the parsing after the first match but the onopentag() listener is still invoked until the effective end of the html file. why ?
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const scriptPath = await new Promise(function(resolve, reject) {
try {
const parser = new WritableStream({
onopentag: (name, attrib) => {
if (name === "script" && attrib.src) {
console.log(`script : ${attrib.src}`);
resolve(attrib.src); // return the first script, effectively called for each script tag
// none of below calls seem to work
indexStream.unpipe(parser);
parser.emit("close");
parser.end();
parser.destroy();
}
},
onend() {
resolve();
}
});
const indexStream = got.stream("/index.html", {
responseType: 'text',
resolveBodyOnly: true
});
indexStream.pipe(parser); // and parse it
} catch (e) {
reject(e);
}
});
Is it possible to close the parser stream before the effective end of indexStream and if yes how ? If not why ?
Note that the code works and my promise is effectively resolved using the first match.
CodePudding user response:
There's a little confusion on how the WriteableStream works. First off, when you do this:
const parser = new WritableStream(...)
that's misleading. It really should be this:
const writeStream = new WritableStream(...)
The actual HTML parser is an instance variable in the WritableStream object named ._parser
(see code). And, it's that parser that is emitting the onopentag()
callbacks and because it's working off a buffer that may have some accumulated text disconnecting from the readstream may not immediately stop events that are still coming from the buffered data.
The parser itself has a public reset()
method and it appears that if disconnected from the readstream and then you called that reset method, it should stop emitting events.
You can try this (I'm not a TypeScript person so you may have to massage some things to make the TypeScript compiler happy, but hopefully you can see the concept here):
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const scriptPath = await new Promise(function(resolve, reject) {
try {
const writeStream = new WritableStream({
onopentag: (name, attrib) => {
if (name === "script" && attrib.src) {
console.log(`script : ${attrib.src}`);
resolve(attrib.src); // return the first script, effectively called for each script tag
// disconnect the readstream
indexStream.unpipe(writeStream);
// reset the internal parser so it clears any buffers it
// may still be processing
writeStream._parser.reset();
}
},
onend() {
resolve();
}
});
const indexStream = got.stream("/index.html", {
responseType: 'text',
resolveBodyOnly: true
});
indexStream.pipe(writeStream); // and parse it
} catch (e) {
reject(e);
}
});