Home > Software engineering >  MongoDB count related documents (3 levels)
MongoDB count related documents (3 levels)

Time:12-24

I need to fast count related documents.

So, I have four collections

groups

{ "_id" : "g1", "name" : "group1" }
{ "_id" : "g2", "name" : "group2" }

courses

{ "_id" : "c1", "name" : "course1", "group_id" : "g1" }
{ "_id" : "c2", "name" : "course2", "group_id" : "g2" }

topics

{ "_id" : "t1", "name" : "top1c11", "course_id" : "c1" }
{ "_id" : "t2", "name" : "top1c12", "course_id" : "c1" }
{ "_id" : "t3", "name" : "top1c21", "course_id" : "c2" }

lessons

{ "_id" : "l1", "name" : "lesson111", "topic_id" : "t1" }
{ "_id" : "l2", "name" : "lesson112", "topic_id" : "t1" }
{ "_id" : "l3", "name" : "lesson121", "topic_id" : "t2" }
{ "_id" : "l4", "name" : "lesson211", "topic_id" : "t3" }

I need count all lessons of the specific group.

I tried to run the following aggregation, but I didn't wait for an response. (But it working for small amount of data)

db.getCollection('lessons').aggregate([
{
    "$lookup": {
        "from": "topics",
        "let": { "topicId": "$topic_id" },
        "pipeline": [
            { 
                "$match": { "$expr": { "$eq": [ "$_id", "$$topicId" ] } } 
            },
            {
                "$lookup": {
                    "from": "courses",
                    "let": { "courseId": "$topic_id" },
                    "pipeline": [
                        { "$match": { "$expr": { "$eq": [ "$course_id", "$$courseId" ] } } },
                    ],
                    "as": "course"
                },
            },
            {
                "$unwind": "$course"
            }

        ],
        "as": "topic"
    },
},
{
    "$unwind" : "$topic"
},
{
    "$match": {
        "topic.course.group_id" : "g1"
    }
},
{
    $group: {
        _id: "$course",
        "amount": {$sum:1},
    }
}
])

I believe this aggregation can be optimized. But I don`t sure that is a good approach to use aggregation framework for such purpose. If so, how can I optimize the aggregation.

Size of collections (test data):

  • courses: 30000
  • topics: 200000
  • lessons: 30000000

Now I use simple nested loops in my code to count lessons. This takes 10 seconds (for 3000 topics of the certain group).

CodePudding user response:

Query1

  • not nested lookups (lookup and unwind)
  • match the group
  • lookup and unwind 3x, last lookup only counts the lessons, and uses pipeline lookup
  • group by group _id, to get the total lessons

Indexes that you need (all the foreignField)

  • courses.group_id
  • topics.course_id
  • lessons.topic_id

Test code here

groups.aggregate(
[{"$match":{"_id":"g1"}},
 {"$lookup":
  {"from":"courses",
   "localField":"_id",
   "foreignField":"group_id",
   "as":"courses"}},
 {"$unwind":"$courses"},
 {"$lookup":
  {"from":"topics",
   "localField":"courses._id",
   "foreignField":"course_id",
   "as":"topics"}},
 {"$unwind":"$topics"},
 {"$lookup":
  {"from":"lessons",
   "localField":"topics._id",
   "foreignField":"topic_id",
   "pipeline":
   [{"$group":{"_id":null, "lessons":{"$sum":1}}},
    {"$set":{"id":"$_id", "_id":"$$REMOVE"}}],
   "as":"lessons"}},
 {"$set":
  {"lessons":
   {"$cond":
    [{"$eq":["$lessons", []]}, 0,
     {"$arrayElemAt":["$lessons.lessons", 0]}]}}},
 {"$group":{"_id":"$_id", "totalLessons":{"$sum":"$lessons"}}}])

Query2

  • nested lookups (without unwind)
  • code is the same, just nested

Test code here

groups.aggregate(
[{"$match":{"_id":"g1"}},
 {"$lookup":
  {"from":"courses",
   "localField":"_id",
   "foreignField":"group_id",
   "pipeline":
   [{"$lookup": 
     {"from":"topics",
      "localField":"_id",
      "foreignField":"course_id",
      "pipeline":
      [{"$lookup":
        {"from":"lessons",
         "localField":"_id",
         "foreignField":"topic_id",
         "pipeline":
         [{"$group":{"_id":null, "lessons":{"$sum":1}}},
          {"$set":{"id":"$_id", "_id":"$$REMOVE"}}],
         "as":"lessons"}},
       {"$set":
        {"lessons":
         {"$cond":
          [{"$eq":["$lessons", []]}, 0,
           {"$arrayElemAt":["$lessons.lessons", 0]}]}}}],
      "as":"topics"}},
    {"$project":
     {"_id":0, "totalLessons":{"$sum":"$topics.lessons"}}}],
   "as":"courses"}},
 {"$set":
  {"courses":"$$REMOVE",
   "totalLessons":
   {"$cond":
    [{"$eq":["$courses", []]}, 0,
     {"$arrayElemAt":["$courses.totalLessons", 0]}]}}}])

If you can send some feedback on which one was faster.
If for 1 group its very fast, maybe remove the match, to do it for all groups, or allow from match to pass more many groups.

CodePudding user response:

Solution from comment of Takis. Query1, adopted for 4.2

groups.aggregate(
[{"$match":{"_id":"g1"}},
 {"$lookup":
  {"from":"courses",
   "localField":"_id",
   "foreignField":"group_id",
   "as":"courses"}},
 {"$unwind":"$courses"},
 {"$lookup":
  {"from":"topics",
   "localField":"courses._id",
   "foreignField":"course_id",
   "as":"topics"}},
 {"$unwind":"$topics"},
 {"$lookup":
  {"from":"lessons",
   "pipeline":
   [{"$match":{"$expr":{"$eq":["$$ptopic", "$topic_id"]}}},
    {"$group":{"_id":null, "lessons":{"$sum":1}}},
    {"$set":{"id":"$_id", "_id":"$$REMOVE"}}],
   "as":"lessons",
   "let":{"ptopic":"$topics._id"}}},
 {"$set":
  {"lessons":
   {"$cond":
    [{"$eq":["$lessons", []]}, 0,
     {"$arrayElemAt":["$lessons.lessons", 0]}]}}},
 {"$group":{"_id":"$_id", "totalLessons":{"$sum":"$lessons"}}}])
  • Related