I have a existing collection of ~12 million documents. I want to update one field in all the documents to have a running number within all groups of documents that share a common "ref" field. This would be a one time operation. Is there any way I can achieve this in MongoDB 4.4?
Simplified documents example:
{"_id": 1, "ref": "REF_A", "description": "aaaa"}
{"_id": 2, "ref": "REF_A", "description": "bbbb"}
{"_id": 3, "ref": "REF_A", "description": "cccc"}
{"_id": 4, "ref": "REF_B", "description": "dddd"}
{"_id": 5, "ref": "REF_B", "description": "eeee"}
...
Desired modified output:
{"_id": 1, "ref": "REF_A", "description": "aaaa1"}
{"_id": 2, "ref": "REF_A", "description": "bbbb2"}
{"_id": 3, "ref": "REF_A", "description": "cccc3"}
{"_id": 4, "ref": "REF_B", "description": "dddd1"} <- reset count because ref changed
{"_id": 5, "ref": "REF_B", "description": "eeee2"}
...
The running number is concatenated to description field here. As soon as the "ref" changes, the concat number counter should reset and start from 1 again. When sorted by "_id" all the same refs are already together. Order matters.
I've been looking at aggregations to solve this, but it seems I would need a way to refer to previous documents and could not figure it out yet.
The best I could find is this thread: Add some kind of row number to a mongodb aggregate command / pipeline But does not seem to suit my case where row number is being reset under a condition.
CodePudding user response:
Query1
- sort by ref and descript
- group by ref and collect the docs for each ref-group
- map to add the index of each member to the description
- unwind replace root, to reset the initial structure
- you need to add
$out
stage in the end of the pipeline(see your driver how to do this), to save this to a new collection, and then replace the one you have now. (we can't use group in any update method even with pipeline) (only way is$out
or$merge
, but merge will be slower) - set also
{allowDiskUse : true}
aggregate(
[{"$sort": {"ref": 1, "description": 1}},
{"$group": {"_id": "$ref", "docs": {"$push": "$$ROOT"}}},
{"$set":
{"docs":
{"$map":
{"input": {"$range": [0, {"$size": "$docs"}]},
"in":
{"$let":
{"vars": {"doc": {"$arrayElemAt": ["$docs", "$$this"]}},
"in":
{"$mergeObjects":
["$$doc",
{"description":
{"$concat":
["$$doc.description",
{"$toString": {"$add": ["$$this", 1]}}]}}]}}}}}}},
{"$unwind": "$docs"},
{"$replaceRoot": {"newRoot": "$docs"}}])
*In MongoDB 5 we have $setWindowFields
for those, but you have MongoDB 4.4, so we only have $group
i think, give it a try, but you have many documents.
Query2
- requires >= MongoDB 5 which you don't have now
- again you need out
$out
aggregate(
[{"$setWindowFields":
{"partitionBy": "$ref",
"sortBy": {"description": 1},
"output": {"rank": {"$rank": {}}}}},
{"$set":
{"description": {"$concat": ["$description", {"$toString": "$rank"}]},
"rank": "$$REMOVE"}}])