Home > Back-end >  What is the fastest way to read & write tiny but MANY files using Nodejs?
What is the fastest way to read & write tiny but MANY files using Nodejs?

Time:11-25

I have a node application which handles JSON files: it reads, parses files and writes new files. And sometimes, by necessary, the files become a massive swarm. First, I think current reading speed looks reasonalbe, but writing speed seems little bit slow.

I'd like to improve this processing speed.

Before I touch this program, I'd tried multi-threading to my python application first, it does similar tasks but handles image files, and the threading successfully reduced its response time.

I wonder if it's okay to use node's worker_thread to get the same effect. Because Node document says

They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.

https://nodejs.org/api/worker_threads.html

The problem is the truth that I don't know whether the current speed is the fastest which the node environment could show or still enhancable without worker_thread.

These are my attempts for imporvemnt: My program reads and writes files one by one from a list of file's paths, with fs-sync functions - readFileSync(), writeFileSync(). First, I thought accessing many files synchronously is not node-ish, so I promisified fs functions(readFile(), writeFile()) and pushed to a list of promise objects. Then I call await Promise.all(promisesList). But this didn't help at all. Even slower.

For the second try, I gave up generating tones of promises, and made a single promise. It kept watching the number of processed files, and call resolve() when the number is equal with the length of total files.

  const waiter = new Promise<boolean>((resolve, rejects) => {
    const loop: () => void = () =>
      processedCount === fileLen ? resolve(true) : setTimeout(loop);
    loop();
  });

I had only waited this promise, and this was the slowest.

Now I think this shows the "asynchronous" does not mean "parallel". So, am I misunderstanding the document's explanation? And should I use worker_threads to improve the file IO speed in this case? Or is there any better solution? Maybe it could be the answer not to use Node for these kind of process, I'd love to but today is Nov 25th sadly...

CodePudding user response:

The real bottleneck here will be the file system implementation. Running up multiple threads to read and / or write multiple files in parallel will give you some speedup, but you quickly run into the file system bottleneck.

As a general rule, typical file systems do not handle the use-case of "gazzillions of tiny files" well. And it gets progressively worse if the files are on slow disks, a remote file system, a hierarchical storage system, etc.

The best solution is to redesign your application so that it doesn't organize its data like that. Better alternatives involve combinations of:

  • using an SQL or NOSQL database to store the data
  • using a flat-file database like SQLite or BDB
  • reading and writing TAR or ZIP archives
  • storing / buffering the data in memory.

If you are trying to get better performance, Python is not where you should look. For a start, a CPU bound multi-threaded application is effectively constrained to a single core ... due to the GIL. And your python code is typically not compiled to native code.

A language like C, C or even Java would be a better choice. But parallelizing an application's file I/O is difficult and the results tend to be disappointing. It is generally better to do it a different way; i.e. avoid application architectures that use lots of little files.

CodePudding user response:

Have you tried node streams API. Also there is JSONStream npm package to parse json stream data. Please have look.

const fs = require('fs');
let sourceFileStream = fs.createReadStream('./file1.json')
let destinationFileStream = fs.createWriteStream('./temp/file1.json')
sourceFileStream.pipe(destinationFileStream)
  • Related