Home > Mobile >  MongoDB running numbers within subsets of documents
MongoDB running numbers within subsets of documents

Time:12-15

I have a existing collection of ~12 million documents. I want to update one field in all the documents to have a running number within all groups of documents that share a common "ref" field. This would be a one time operation. Is there any way I can achieve this in MongoDB 4.4?

Simplified documents example:

{"_id": 1, "ref": "REF_A", "description": "aaaa"}
{"_id": 2, "ref": "REF_A", "description": "bbbb"}
{"_id": 3, "ref": "REF_A", "description": "cccc"}
{"_id": 4, "ref": "REF_B", "description": "dddd"}
{"_id": 5, "ref": "REF_B", "description": "eeee"}
...

Desired modified output:

{"_id": 1, "ref": "REF_A", "description": "aaaa1"}
{"_id": 2, "ref": "REF_A", "description": "bbbb2"}
{"_id": 3, "ref": "REF_A", "description": "cccc3"}
{"_id": 4, "ref": "REF_B", "description": "dddd1"} <- reset count because ref changed
{"_id": 5, "ref": "REF_B", "description": "eeee2"}
...

The running number is concatenated to description field here. As soon as the "ref" changes, the concat number counter should reset and start from 1 again. When sorted by "_id" all the same refs are already together. Order matters.

I've been looking at aggregations to solve this, but it seems I would need a way to refer to previous documents and could not figure it out yet.

The best I could find is this thread: Add some kind of row number to a mongodb aggregate command / pipeline But does not seem to suit my case where row number is being reset under a condition.

CodePudding user response:

Query1

  • sort by ref and descript
  • group by ref and collect the docs for each ref-group
  • map to add the index of each member to the description
  • unwind replace root, to reset the initial structure
  • you need to add $out stage in the end of the pipeline(see your driver how to do this), to save this to a new collection, and then replace the one you have now. (we can't use group in any update method even with pipeline) (only way is $out or $merge, but merge will be slower)
  • set also {allowDiskUse : true}

Test code here

aggregate(
[{"$sort": {"ref": 1, "description": 1}},
  {"$group": {"_id": "$ref", "docs": {"$push": "$$ROOT"}}},
  {"$set": 
    {"docs": 
      {"$map": 
        {"input": {"$range": [0, {"$size": "$docs"}]},
          "in": 
          {"$let": 
            {"vars": {"doc": {"$arrayElemAt": ["$docs", "$$this"]}},
              "in": 
              {"$mergeObjects": 
                ["$$doc",
                  {"description": 
                    {"$concat": 
                      ["$$doc.description",
                        {"$toString": {"$add": ["$$this", 1]}}]}}]}}}}}}},
  {"$unwind": "$docs"},
  {"$replaceRoot": {"newRoot": "$docs"}}])

*In MongoDB 5 we have $setWindowFields for those, but you have MongoDB 4.4, so we only have $group i think, give it a try, but you have many documents.

Query2

  • requires >= MongoDB 5 which you don't have now
  • again you need out $out

Test code here

aggregate(
[{"$setWindowFields": 
    {"partitionBy": "$ref",
      "sortBy": {"description": 1},
      "output": {"rank": {"$rank": {}}}}},
  {"$set": 
    {"description": {"$concat": ["$description", {"$toString": "$rank"}]},
      "rank": "$$REMOVE"}}])
  • Related