Home > Back-end >  What is the time complexity of fetching data from a table that is referenced in another table?
What is the time complexity of fetching data from a table that is referenced in another table?

Time:12-20

const image_schema = () => {
  const common_fields = {
    user_id: {
      type: mongoose.Schema.Types.ObjectId,
      ref: "user",
      required: true,
    },
    file_name: {
      type: String,
      required: true,
    },
  };
  return new mongoose.Schema(common_fields, {
    collection: `image`,
    timestamps: true,
  });
};

The above is the mongoDB schema for the image collection.

Whenever I need to fetch a subset of rows in this table, I would also need to get the corresponding user info from the user table that is referenced by user_id column.

What is the time complexity of fetching the additional columns from the user table?

Would the speed performance be significantly better if those additional columns from the user collection is included in the image collection, hence breaking normalization?

CodePudding user response:

Technically it's O(1) for embedded documents vs O(n) for referenced ones at query time on mongodb side, but there is also data transfer and mongoose hydration - all are O(n) for both cases. Essentially it's the same O(n) with slightly worse gradient. Please read details below.

Please note mongoose (v6 at the time of writting) doesn't use $lookup but "more powerful alternative called populate()" and since it's mongoose, most of the time is spent on the client to unmarshal bson to json and then hydrate json into Mongoose models.

Mongoose fetches refs in batches, by default 5000 documents, so if you query less than 5000 images, it will be one more query to fetch all referenced users. Although it's technically O(n) the absolute values are quite small - if users fit into working set, it's a matter of millis to query the data serverside. You will likely spend more time transferring data from mongo to the client.

It will require much more time to convert bson to json. It's O(n) and n in this case is number of fields x number of objects. This is part of mongo nodejs driver and the only thing you can improve here is to project only required fields.

The most expensive part is converting json to Mongoose. Complexity is still O(n) but it's so time consuming that there is even a lean options to skip this step and return plain json for higher performance. So using:

.populate({
  path: 'user_id',
  select: <only required user's info> ,
  options: { lean: true}
})

Will make the overhead negligible. Please keep in mind the user's fields will be read-only.

Data modification is more important thing to consider than time complexity. Whilst denormalisation might give a measurable improvement on query speed, it opens a whole data synchronisation can of worms - if you change "corresponding user info" in the user table it won't be automatically reflected in the user's info stored in the "image" collection.

So there are few things to consider if you denormalise data:

  • you will need to change user update logic to update information in all related collections
  • you may need to wrap it in multi-document transaction to ensure data integrity
  • you will need to monitor changes from outside of your app, e.g. manual changes with mongosh

CodePudding user response:

What is the time complexity of fetching the additional columns from the user table?

Well, for each image you need to perform an additional read wether you fetch it using $lookup or in code after the initial fetch.

So there is an obvious performance overhead to this approach ( however in "real life" terms this difference is usually negligent ), with that said I personally still prefer the "normalization" approach in most cases.

There's a tradeoff between the two approaches, if your user is never updated, additional storage usage by the images collection is a none issue then maybe you could benefit from breaking "normalization". It really depends on your product usage.

There are many factors needed to be considered before actually deciding on this, scale (of data) and actual performance need being the top 2 factors in my opinion

  • Related