Home > OS >  Node.js Large File Uploads to MongoDB blocking the Event Loop and Worker Pool
Node.js Large File Uploads to MongoDB blocking the Event Loop and Worker Pool

Time:05-13

So I want to upload large CSV files to a mongoDB cloud database using a Node.js server using Express, Mongoose and Multer's GridFS storage engine, but when the file upload starts, my database becomes unable to handle any other API requests. For example, if a different client requests to get a user from the database while the file is being uploaded, the server will recieve the request and try to fetch the user from the MongoDB cloud, but the request will get stuck because the large file upload eats up all the computational resources. As a result, the get request performed by the client will not return the user until the file upload that is in progress is completed.

I understand that if a thread is taking a long time to execute a callback (Event loop) or a task (Worker), then it is considered "blocked" and that Node.js runs JavaScript code in the Event Loop while it offers a Worker Pool to handle expensive tasks like file I/O. I've read on this blog post by NodeJs.org that in order to keep your Node.js server speedy, the work associated with each client at any given time must be "small" and that my goal should be to minimize the variation in Task times. The reasoning behing this is that if a Worker's current Task is much more expensive than other Tasks, it will be unavailable to work on other pending Tasks, thus decreasing the size of the Worker Pool by one, until the Task is completed.

In other words, the client performing the large file upload is executing an expensive Task that decreases the throughput of the Worker Pool, in turn decreasing the throughput of the server. According to the aforementioned blog post, when each sub-task completes it should submit the next sub-Task, and when the final sub-Task is done, it should notify the submitter. This way, between each sub-Task of the long Task (the large file upload), the Worker can work on a sub-Task from a shorter Task, thus solving the blocking problem.

However, I do not know how to implement this solution in actual code. Are there any specific partioned functions that can solve this problem? Do I have to use a specific upload architecture or a node package other than multer-gridfs-storage to upload my files? Please help

Here is my current file upload implementation using Multer's GridFS storage engine:

   // Adjust how files get stored.
   const storage = new GridFsStorage({
       // The DB connection
       db: globalConnection, 
       // The file's storage configurations.
       file: (req, file) => {
           ...
           // Return the file's data to the file property.
           return fileData;
       }
   });

   // Configure a strategy for uploading files.
   const datasetUpload = multer({ 
       // Set the storage strategy.
       storage: storage,

       // Set the size limits for uploading a file to 300MB.
       limits: { fileSize: 1024 * 1024 * 300 },
    
       // Set the file filter.
       fileFilter: fileFilter,
   });


   // Upload a dataset file.
   router.post('/add/dataset', async (req, res)=>{
       // Begin the file upload.
       datasetUpload.single('file')(req, res, function (err) {
           // Get the parsed file from multer.
           const file = req.file;
           // Upload Success. 
           return res.status(200).send(file);
       });
   });

CodePudding user response:

I think this problem is sourced from the buffer. Because the buffer has to receive all chunks and then the entire buffer is sent to the consumer, so buffering takes a long time. Streams can solve this problem so streams allow us to process the data as soon as it arrives from the source and to do things that would not be possible by buffering data and processing it all at once. I found storage.fromStream() method on the multer GitHub page and tested it by uploading a 122 MB file, it worked for me, thanks to Node.js streams, every chunk of data is consumed and saved to the cloud database as soon as it is received. the total time of uploads had been less than 1 minute, and the server could easily respond to the other requests during the upload.

const {GridFsStorage} = require('multer-gridfs-storage');
const multer = require('multer');
const upload = multer({ dest: 'uploads/' });
const express = require('express');
const fs = require('fs');
const connectDb = require('./connect');
const app = express();
 
const storage = new GridFsStorage({db:connectDb()});

app.post('/profile', upload.single('file'), function (req, res, next) {
  const {file} = req;
  const stream = fs.createReadStream(file.path); //creates stream
  storage.fromStream(stream, req, file)
    .then(() => res.send('File uploaded')) //saves data as binary to cloud db
    .catch(() => res.status(500).send('error'));
});
app.get('/profile',(req,res)=>{
    res.send("hello");
})

app.listen(5000);

CodePudding user response:

So after a couple of days of research I've found out that the root of the problem wasn't Node.JS or my file upload implementation. The problem was that I was using the "Serverless pricing package" of MongoDB Atlas that couldn't handle the file upload workload at the same time with other operations such as fetching users from my database. As it turns out, Node.js was recieving API calls from other clients as it should be, but they weren't returning any results because they were getting stuck at the DB level. Once I updated my cluster to a M10 dedicated cluster that uses 2 vCPUs, the problem was resolved.

According to this blog post about MongoDB Best Practices the total number of active threads (i.e., concurrent operations) relative to the number of CPUs can impact performance and therefore the throughput of the Node.js server.

CodePudding user response:

Can you manage architecture/infrastructure? If so, this challenge would be best solved by different approach. This is actually perfect candidate for serverless solution, i.e. Lambda.

Lambda does not run any requests on one machine in parallel. Lambda assign one request to one machine and until the request is finished this machine will not receive any other traffic. Therefore you will never hit the limits you are encountering now.

  • Related