Home > Blockchain >  Remove duplicate records from MongoDB
Remove duplicate records from MongoDB

Time:12-16

We have a MongoDB collection that has duplicate revisions for the same transaction.

Our ask is to cleanup this collection and retain only the latest revision of each transaction. I wrote a script that does remove the duplicates but for some records it does not keep the latest revision. The script is available here for reference: https://gist.github.com/praveenkrjha/1f02ed675e297a2c85e9020e39320a41

Initial record (tid is the transaction id and rv is the revision number):

_id: 61b994a48d546d882462da25, tid: b1def2a8-01e0-4f8c-a859-7cd63287617c, rv: 42 
_id: 61b994a48d546d882462da27, tid: b1def2a8-01e0-4f8c-a859-7cd63287617c, rv: 42 
_id: 61b994a48d546d882462da29, tid: b1def2a8-01e0-4f8c-a859-7cd63287617c, rv: 42 
_id: 61b994a48d546d882462da2b, tid: d9b9e197-6f91-480c-8757-2a24330bc4c5, rv: 21 
_id: 61b994a48d546d882462da2d, tid: d9b9e197-6f91-480c-8757-2a24330bc4c5, rv: 22 
_id: 61b994a48d546d882462da2f, tid: df0f1252-51d3-4471-be68-a81e8b9a11e2, rv: 31 
_id: 61b994a48d546d882462da31, tid: df0f1252-51d3-4471-be68-a81e8b9a11e2, rv: 32 
_id: 61b994a48d546d882462da33, tid: bc92c1ee-91b4-4eea-b9ad-752f8dba6c54, rv: 11 
_id: 61b994a48d546d882462da35, tid: bc92c1ee-91b4-4eea-b9ad-752f8dba6c54, rv: 12 
_id: 61b994a48d546d882462da37, tid: 2a131e9a-07f1-413f-a21f-4f1bbbb3d460, rv: 1 
_id: 61b994a48d546d882462da39, tid: e2c7fb60-a1e2-46ea-8d6a-e94e14ded2d3, rv: 1 
_id: 61b994a48d546d882462da3b, tid: 314f7cb2-1aad-4add-b2e1-17e910d97713, rv: 1

Final result after clean-up:

_id: 61b994a48d546d882462da25, tid: b1def2a8-01e0-4f8c-a859-7cd63287617c, rv: 42 
_id: 61b994a48d546d882462da2b, tid: d9b9e197-6f91-480c-8757-2a24330bc4c5, rv: 21 
_id: 61b994a48d546d882462da2f, tid: df0f1252-51d3-4471-be68-a81e8b9a11e2, rv: 31 
_id: 61b994a48d546d882462da33, tid: bc92c1ee-91b4-4eea-b9ad-752f8dba6c54, rv: 11 
_id: 61b994a48d546d882462da37, tid: 2a131e9a-07f1-413f-a21f-4f1bbbb3d460, rv: 1 
_id: 61b994a48d546d882462da39, tid: e2c7fb60-a1e2-46ea-8d6a-e94e14ded2d3, rv: 1 
_id: 61b994a48d546d882462da3b, tid: 314f7cb2-1aad-4add-b2e1-17e910d97713, rv: 1 

As you can see from final result, it deletes the latest revision for some transaction, which should not be done. Any pointers on how that can be achieved?

CodePudding user response:

Try this one:

db.collection.aggregate([
   { $sort: { rv: -1 } },
   { $group: { _id: "$tid", data: { $first: "$$ROOT" } } },
   { $replaceWith: "$data" }
])

or

db.collection.aggregate([
   { $sort: { rv: 1 } },
   { $group: { _id: "$tid", data: { $last: "$$ROOT" } } },
   { $replaceWith: "$data" }
])
  • Related