Home > Blockchain >  Node.js htmlparser2 writableStream still emit events after end() call
Node.js htmlparser2 writableStream still emit events after end() call

Time:08-19

Sorry for the probable trivial question but I still fail to get how streams work in node.js.

I want to parse an html file and get the path of the first script I encounter. I'd like to interrupt the parsing after the first match but the onopentag() listener is still invoked until the effective end of the html file. why ?

  const { WritableStream } = require("htmlparser2/lib/WritableStream"); 
  const scriptPath = await new Promise(function(resolve, reject) {
    try {
      const parser = new WritableStream({
        onopentag: (name, attrib) => {
          if (name === "script" && attrib.src) {
            console.log(`script : ${attrib.src}`);
            resolve(attrib.src);  // return the first script, effectively called for each script tag
            // none of below calls seem to work
            indexStream.unpipe(parser);
            parser.emit("close");
            parser.end();
            parser.destroy();                
          }
        },
        onend() {
          resolve();
        }
      });
      const indexStream = got.stream("/index.html", {
        responseType: 'text',
        resolveBodyOnly: true
      });
      indexStream.pipe(parser); // and parse it
    } catch (e) {
      reject(e);
    }
  });

Is it possible to close the parser stream before the effective end of indexStream and if yes how ? If not why ?

Note that the code works and my promise is effectively resolved using the first match.

CodePudding user response:

There's a little confusion on how the WriteableStream works. First off, when you do this:

const parser = new WritableStream(...)

that's misleading. It really should be this:

const writeStream = new WritableStream(...)

The actual HTML parser is an instance variable in the WritableStream object named ._parser (see code). And, it's that parser that is emitting the onopentag() callbacks and because it's working off a buffer that may have some accumulated text disconnecting from the readstream may not immediately stop events that are still coming from the buffered data.

The parser itself has a public reset() method and it appears that if disconnected from the readstream and then you called that reset method, it should stop emitting events.

You can try this (I'm not a TypeScript person so you may have to massage some things to make the TypeScript compiler happy, but hopefully you can see the concept here):

  const { WritableStream } = require("htmlparser2/lib/WritableStream"); 
  const scriptPath = await new Promise(function(resolve, reject) {
    try {
      const writeStream = new WritableStream({
        onopentag: (name, attrib) => {
          if (name === "script" && attrib.src) {
            console.log(`script : ${attrib.src}`);
            resolve(attrib.src);  // return the first script, effectively called for each script tag
            // disconnect the readstream
            indexStream.unpipe(writeStream);
            // reset the internal parser so it clears any buffers it
            // may still be processing
            writeStream._parser.reset();
          }
        },
        onend() {
          resolve();
        }
      });
      const indexStream = got.stream("/index.html", {
        responseType: 'text',
        resolveBodyOnly: true
      });
      indexStream.pipe(writeStream); // and parse it
    } catch (e) {
      reject(e);
    }
  });
  • Related