Node.js Large File Uploads blocking the Event Loop and Worker Pool-CodePudding

So I want to upload large CSV files to a mongoDB cloud database using a Node.js server using Express, Mongoose and Multer's GridFS storage engine, but when the file upload starts, my Node.js server becomes unable to handle any other API requests. For example, if a different client requests to get a user from the database while the file is being uploaded, the server will recieve the request and try to fetch the user from the MongoDB cloud, but the request will get stuck because the large file upload eats up all the computational resources. As a result, the get request performed by the client will not return the user until the file upload that is in progress is completed.

I understand that if a thread is taking a long time to execute a callback (Event loop) or a task (Worker), then it is considered "blocked" and that Node.js runs JavaScript code in the Event Loop while it offers a Worker Pool to handle expensive tasks like file I/O. I've read on this blog post by NodeJs.org that in order to keep your Node.js server speedy, the work associated with each client at any given time must be "small" and that my goal should be to minimize the variation in Task times. The reasoning behing this is that if a Worker's current Task is much more expensive than other Tasks, it will be unavailable to work on other pending Tasks, thus decreasing the size of the Worker Pool by one, until the Task is completed.

In other words, the client performing the large file upload is executing an expensive Task that decreases the throughput of the Worker Pool, in turn decreasing the throughput of the server. To work around this I need to minimize variation in Task times by partitioning each Task into comparable-cost sub-Tasks. According to the aforementioned blog post, when each sub-task completes it should submit the next sub-Task, and when the final sub-Task is done, it should notify the submitter. This way, between each sub-Task of the long Task (the large file upload), the Worker can work on a sub-Task from a shorter Task, thus solving the blocking problem.

However, I do not know how to implement this solution in actual code. Are there any specific partioned functions that can solve this problem? Do I have to use a specific upload architecture or a node package other than multer-gridfs-storage to upload my files? Please help

Here is my current file upload implementation using Multer's GridFS storage engine:

   // Adjust how files get stored.
   const storage = new GridFsStorage({
       // The DB connection
       db: globalConnection, 
       // The file's storage configurations.
       file: (req, file) => {
           ...
           // Return the file's data to the file property.
           return fileData;
       }
   });

   // Configure a strategy for uploading files.
   const datasetUpload = multer({ 
       // Set the storage strategy.
       storage: storage,

       // Set the size limits for uploading a file to 300MB.
       limits: { fileSize: 1024 * 1024 * 300 },
    
       // Set the file filter.
       fileFilter: fileFilter,
   });


   // Upload a dataset file.
   router.post('/add/dataset', async (req, res)=>{
       // Begin the file upload.
       datasetUpload.single('file')(req, res, function (err) {
           // Get the parsed file from multer.
           const file = req.file;
           // Upload Success. 
           return res.status(200).send(file);
       });
   });

CodePudding user response：

Can you manage architecture/infrastructure? If so, this challenge would be best solved by different approach. This is actually perfect candidate for serverless solution, i.e. Lambda.

Lambda does not run any requests on one machine in parallel. Lambda assign one request to one machine and until the request is finished this machine will not receive any other traffic. Therefore you will never hit the limits you are encountering now.

CodePudding user response：

I think this problem is sourced from the buffer. Because the buffer has to receive all chunks and then the entire buffer is sent to the consumer. Maybe streams can solve this problem so streams allow us to process the data as soon as it arrives from the source and to do things that would not be possible by buffering data and processing it all at once. I found storage.fromStream() method on the multer GitHub page and tested it by uploading a 122 MB file, it worked for me, the total time of uploads had been less than 1 minute, and the server could easily respond to the other requests during the upload.

const {GridFsStorage} = require('multer-gridfs-storage');
const multer = require('multer');
const upload = multer({ dest: 'uploads/' });
const express = require('express');
const fs = require('fs');
const connectDb = require('./connect');
const app = express();
 
const storage = new GridFsStorage({db:connectDb()});

app.post('/profile', upload.single('file'), function (req, res, next) {
  const {file} = req;
  const stream = fs.createReadStream(file.path); //creates stream
  storage.fromStream(stream, req, file)
    .then(() => res.send('File uploaded')) //saves data as binary to cloud db
    .catch(() => res.status(500).send('error'));
});
app.get('/profile',(req,res)=>{
    res.send("hello");
})

app.listen(5000);