When a table contains column that references the _id of another table, do you need to include the co-CodePudding

In MongoDB,

let's suppose that I have a table called Skills, which has a column that references the primary key of Users.

The queries that involve fetching data from Skills will also need to fetch data about the Users.

Question:

Should the columns from Users that hold data that will be fetched be included in the Skills table, or should they simply be looked up from the Users table?

EDITED:

Which way would be faster? And if there is a difference, is it negligible?

CodePudding user response：

Best way to answer your question is to run a performance test on your dataset and measure the difference. First thing you need to consider is the size of your dataset and the way you will be interacting with your data (reading, updating, inserting, deleting).

In general $lookup works in a similar way to SQL's JOIN however for large datasets it should be avoided as multi-collection operation can affect your performance.

Given your users and skills it seems to be a many-to-many relationship however there are few approaches you can investigate. It all depends on your data access pattern and the ways your application queries the data.

Two collections

Your data is normalized but you rely on $lookup. If you query users without skills or skills without users it can be beneficial. It also makes data update easier comparing to other approaches

Collection of users with an array of skills

This one seems to be interesting as it's rather one-to-few rather than one-to-many relationship. It allows you to retrieve a user with corresponding skills in the fastest way (no $lookup, just query by _id). It also lets you query all users based on their skills (skills can be indexed and all array queries can be applied). In case you want to update skill's name or any other attribute you need to run update statement which affects multiple documents as your data is denormalized and it can be considered a drawback however you know if such scenario will happen frequently or not. Any other aggregation on skills is also possible

Collection of skills with an array of users

This scenario seems to be one-to-many rather than one-to-few. It means that your documents can become gigantic when your system grows (imagine how many users will have driving license skills etc). The other drawback is that it's hard to retrieve user data when it's copied across multiple documents. Same for updating users data.

Two separate collections: skills and users with embedded data subset

You can also consider having skills with minimal user information (only the attributes that need to be retrieved in your query) along with second users collection or vice versa. In this case you are duplicating your data between two collections which makes any updates troublesome however it can be optimal from query performance perspective.

As you can see data modelling always comes with some drawback. You need to understand your data access patterns to make the right choice. The are at least 4 different possibilities and I would encourage to try all of them, measure the performance as well as query/update complexity and it should give you enough input to make the decision.

CodePudding user response：

The columns should be Looked up in the Users table. Copying them to Skills table will violate normalization rules.