Home > Mobile >  Optimise MongoDB aggregate query performance
Optimise MongoDB aggregate query performance

Time:10-15

I have next DB structure:

Workspaces:

Key Index
PK id id
content

Projects:

Key Index
PK id id
FK workspace workspace_1
deleted deleted_1
content

Items:

Key Index
PK id id
FK project project_1
type _type_1
deleted deleted_1
content

I need to calculate a number of items of each type for each project in workspace, e.g. expected output:

[
  { _id: 'projectId1', itemType1Count: 100, itemType2Count: 50, itemType3Count: 200 },
  { _id: 'projectId2', itemType1Count: 40, itemType2Count: 100, itemType3Count: 300 },
  ....
]

After few attempts and some debugging I've created a query which provides output I needed:

const pipeline = [
    { $match: { workspace: 'workspaceId1' } },
    {
      $lookup: {
        from: 'items',
        let: { id: '$_id' },
        pipeline: [
          {
            $match: {
              $expr: {
                $eq: ['$project', '$$id'],
              },
            },
          },
          // project only fields necessary for later pipelines to not overload
          // memory and to not get `exceeded memory limit for $group` error
          { $project: { _id: 1, type: 1, deleted: 1 } },
        ],
        as: 'items',
      },
    },
    // Use $unwind here to optimize aggregation pipeline, see:
    // https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
    // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
    // Error appears not in all requests and it's really strange and hard to debug.
    { $unwind: '$items' },
    { $match: { 'items.deleted': { $eq: false } } },
    {
      $group: {
        _id: '$_id',
        items: { $push: '$items' },
      },
    },
    {
      $project: {
        _id: 1,
        // Note: I have only 3 possible item types, so it's OK that it's names hardcoded.
        itemType1Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type1'] },
            },
          },
        },
        itemType2Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type2'] },
            },
          },
        },
        itemType3Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type3'] },
            },
          },
        },
      },
    },
  ]

const counts = await Project.aggregate(pipeline)

Query works like expected, but very slow... If I have some about 1000 items in one workspace it takes about 8 seconds to complete. Any ideas how to make it faster are appreciated.

Thanks.

CodePudding user response:

Assuming your indexs are properly indexed that they contain the "correct" fields, we can still have some tweaks on the query itself.

Approach 1: keeping existing collection schema

db.projects.aggregate([
  {
    $match: {
      workspace: "workspaceId1"
    }
  },
  {
    $lookup: {
      from: "items",
      let: {id: "$_id"},
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                {$eq: ["$project","$$id"]},
                {$eq: ["$deleted",false]}
              ]
            }
          }
        },
        // project only fields necessary for later pipelines to not overload
        // memory and to not get `exceeded memory limit for $group` error
        {
          $project: {
            _id: 1,
            type: 1,
            deleted: 1
          }
        }
      ],
      as: "items"
    }
  },
  // Use $unwind here to optimize aggregation pipeline, see:
  // https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
  // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
  // Error appears not in all requests and it's really strange and hard to debug.
  {
    $unwind: "$items"
  },
  {
    $group: {
      _id: "$_id",
      itemType1Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type1"]},
                "then": 1,
                "else": 0
            }
        }
      },
      itemType2Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type2"]},
                "then": 1,
                "else": 0
            }
        }
      },
      itemType3Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type1"]},
                "then": 1,
                "else": 0
            }
        }
      }
    }
  }
])

There are 2 major changes:

  1. moving the items.deleted : false condition into the $lookup subpipeline to lookup less items documents
  2. skipped items: { $push: '$items' }. Instead, do a conditional sum in later $group stage

Here is the Mongo playground for your reference. (at least for the correctness of the new query)

Approach 2: If the collection schema can be modified. We can denormalize projects.workspace into the items collection like this:

{
    "_id": "i1",
    "project": "p1",
    "workspace": "workspaceId1",
    "type": "type1",
    "deleted": false
}

In this way, you can skip the $lookup. A simple $match and $group will suffice.

db.items.aggregate([
  {
    $match: {
      "deleted": false,
      "workspace": "workspaceId1"
    }
  },
  {
    $group: {
      _id: "$project",
      itemType1Count: {
        $sum: {
          "$cond": {
            "if": {$eq: ["$type","type1"]},
            "then": 1,
            "else": 0
          }
        }
      },
      ...

Here is the Mongo playground with denormalized schema for your reference.

  • Related