Home > Software design >  How to merge two collections keeping the document with highest timestamp in MongoDB
How to merge two collections keeping the document with highest timestamp in MongoDB

Time:09-27

I'm creating a MongoDB client for a Go Application, using the MongoDB Go Driver. In particular, I have two databases with one collection each. These collection can be modified asynchronously by different clients, so i need to periodically synchronize them, keeping the most recently edited document, among those with the same id field

The two databases are stored on different hosts, so i need to export the collection from one host using mongoexport and import into the other host using mongoimport.

I already tried using mongoimport --collection=myColl --mode=merge, but this doesn't fit my goal because simply overrides the conflicting documents from myColl with the imported ones.

My idea is to import the json into a temp collection, but i don't know how to compare the timestamps during the aggregation/merge process.

My collections are structured Like this, any idea?

Collection 1

{"_id":"K1","value":"VAL1","timest":{"$date":"2021-09-26T09:05:09.942Z"}}
{"_id":"K2","value":"VAL2","timest":{"$date":"2021-09-26T09:05:10.234Z"}}

Collection 2

{"_id":"K2","value":"VAL3","timest":{"$date":"2021-09-26T09:15:09.942Z"}}
{"_id":"K3","value":"VAL4","timest":{"$date":"2021-09-26T09:15:10.234Z"}}

Desired Behaviour

Conflict

    {"_id":"K2","value":"VAL2","timest":{"$date":"2021-09-26T09:05:10.234Z"}}
{"_id":"K2","value":"VAL3","timest":{"$date":"2021-09-26T09:15:09.942Z"}}[LATEST]

Output

{"_id":"K1","value":"VAL1","timest":{"$date":"2021-09-26T09:05:09.942Z"}}
{"_id":"K2","value":"VAL3","timest":{"$date":"2021-09-26T09:15:09.942Z"}}
{"_id":"K3","value":"VAL4","timest":{"$date":"2021-09-26T09:15:10.234Z"}}

CodePudding user response:

You can use $merge

The bellow merges testdb1.coll to testdb2.coll based on same _id And keeps the document with the latest date. If _id is not found, then document is inserted.

Data in

testdb1.coll

[{"_id" "K2","value" "VAL3","timest" (date "2021-09-26T09:15:09.942Z")}  
 {"_id" "K3","value" "VAL4","timest" (date "2021-09-26T09:15:10.234Z")}]

testdb2.coll

[{"_id" "K1","value" "VAL1","timest" (date "2021-09-26T09:05:09.942Z")}
{"_id" "K2","value" "VAL2","timest" (date "2021-09-26T09:05:10.234Z")}]

Results

testdb2.coll (after the merge)

{"_id": "K1", "value": "VAL1", "timest": {"$toDate": "2021-09-26T09:05:09.942Z"}}
{"_id": "K2", "value": "VAL3", "timest": {"$toDate": "2021-09-26T09:15:09.942Z"}}
{"_id": "K3", "value": "VAL4", "timest": {"$toDate": "2021-09-26T09:15:10.234Z"}}

Query
(instead of $let you could use $$new)

client.db("testdb1").collection("coll").aggregate(
[
    {
      "$merge": {
        "into": {
          "db": "testdb2",
          "coll": "coll"
        },
        "on": [
          "_id"
        ],
        "let": {
          "p_ROOT": "$$ROOT"
        },
        "whenMatched": [
          {
            "$replaceRoot": {
              "newRoot": {
                "$cond": [
                  {
                    "$gt": [
                      "$$p_ROOT.timest",
                      "$timest"
                    ]
                  },
                  "$$p_ROOT",
                  "$$ROOT"
                ]
              }
            }
          }
        ],
        "whenNotMatched": "insert"
      }
    }
  ])

CodePudding user response:

You can do following in an aggregation pipeline:

  1. use $unionWith to combine the 2 collections
  2. $sort to order them by timest
  3. use $first to get the latest document
  4. use $replaceRoot to get the final form your want

Here is the Mongo playground for your reference.

  • Related