Home > Net >  optimize indexes in MongoDB
optimize indexes in MongoDB

Time:10-29

I have a Order collection with records looking like this:

{
    "_id": ObjectId,
    "status": String Enum,
    "products": [{
        "sku": String UUID,
        ...
    }, ...],
    ...
},

My goal is to find find what products user buy together. Given an sku, i would like to browse the past order and find, for orders that contains more than 1 product AND of course the product with the looked up sku, what other products were bought along.

So I created a aggregation pipeline that works :

[
    // exclude cancelled orders
  {
    '$match': {
      'status': {
        '$nin': [
          'CANCELLED', 'CHECK_OUT'
        ]
      }
    }
  },
// add a fields with product size and just the products sku
 {
    '$addFields': {
      'size': {
        '$size': '$products'
      }, 
      'skus': '$products.sku'
    }
  }, 
// limit to orders with 2 products or more including the looked up SKU
{
    '$match': {
      'size': {
        '$gte': 2
      }, 
      'skus': {
        '$elemMatch': {
          '$eq': '3516215049767'
        }
      }
    }
  }, 
// group by skus
{
    '$unwind': {
      'path': '$skus'
    }
  }, {
    '$group': {
      '_id': '$skus', 
      'count': {
        '$sum': 1
      }
    }
  }, 

// sort by count, exclude the looked up sku, limit to 4 results
{
$sort': {
      'count': -1
    }
  }, {
    '$match': {
      '_id': {
        '$ne': '3516215049767'
      }
    }
  }, {
    '$limit': 4
  }
]

Althought this works, this collection contains more than 10K docs and I have an alert on my MongoDB instance telling me than the ratio Scanned Objects / Returned has gone above 1000.

So my question is, how can my query be improve? and what indexes can I add to improve this?

db.Orders.stats();
{
  size: 14329835,
  count: 10571,
  avgObjSize: 1355,
  storageSize: 4952064,
  freeStorageSize: 307200,
  capped: false
  nindexes: 2,
  indexBuilds: [],
  totalIndexSize: 466944,
  totalSize: 5419008,
  indexSizes: { _id_: 299008, status_1__created_at_1: 167936 },
  scaleFactor: 1,
  ok: 1,
  operationTime: Timestamp({ t: 1635415716, i: 1 })
}

CodePudding user response:

Let's start with rewriting the query a little bit to make it more efficient. Currently you're matching all the orders with a certain status and after that you're starting with data manipulations, this means every single stage is doing work on a larger than needed data set.

What we can do is move all the queries into the first stage, this is made possible using Mongo's dot notation, like so:

{
    '$match': {
        'status': {
            '$nin': [
                'CANCELLED', 'CHECK_OUT',
            ],
        },
        'products.sku': '3516215049767', // mongo allows you to do this using the dot notation.
        'products.1': { $exists: true }, // this requires the array to have at least two elements.
    },
},

Now this achieves two things:

  1. We start the pipeline only with relevant results, no need to calculate the $size of the array anymore to many unrelevant documents. This already will boost your performance greatly.
  2. Now we can create a compound index that will support this specific query, before we couldn't do that as index usage is limited to the first step and that only included the status field. ( just as an anecdote is that Mongo actually does optimize pipelines, but in this specific case no optimization was possible to to the usage of $addFields )

The index that I recommend building is:

{ status: 1, "products.sku": 1 }

This will allow the best match to start off your pipeline.

  • Related