Get current state from snapshot documents

I'm trying to get a list of current holders at specific times from a collection. My collection looks like this:

[
  {
    "time": 1,
    "holdings": [
      { "owner": "A", "tokens": 2 },
      { "owner": "B", "tokens": 1 }
    ]
  },
  {
    "time": 2,
    "holdings": [
      { "owner": "B", "tokens": 2 }
    ]
  },
  {
    "time": 3,
    "holdings": [
      { "owner": "A", "tokens": 3 },
      { "owner": "B", "tokens": 1 },
      { "owner": "C", "tokens": 1 }
    ]
  },
  {
    "time": 4,
    "holdings": [
      { "owner": "C", "tokens": 0 }
    ]
  }
]

tokens show the current holdings of an owner if the holdings have changed to the last document. I would like to change the collection so that holdings always includes the full current holdings for any point in time.

At time: 1, the holdings are: A: 2, B: 1. At time: 2, the holdings are: A: 2, B: 2. The collections does not include A's holdings however, because they haven't changed. So what I'd like to get is:

[
  {
    "time": 1,
    "holdings": [
      { "owner": "A", "tokens": 2 },
      { "owner": "B", "tokens": 1 }
    ]
  },
  {
    "time": 2,
    "holdings": [
      { "owner": "A", "tokens": 2 },  // merged from prev doc.
      { "owner": "B", "tokens": 2 }
    ]
  },
  {
    "time": 3,
    "holdings": [
      { "owner": "A", "tokens": 3 },
      { "owner": "B", "tokens": 1 },
      { "owner": "C", "tokens": 1 }
    ]
  },
  {
    "time": 4,
    "holdings": [
      { "owner": "A", "tokens": 3 },  // merged from prev
      { "owner": "B", "tokens": 1 },  // merged from prev
      { "owner": "C", "tokens": 0 }
    ]
  }
]

From what I understand $mergeObjects does that, but I don't understand how I can merge all previous docs in order up to the current doc for each doc. So I'm looking for a way to combine setWindowFields with mergeObjects I think.

CodePudding user response：

This is a nice challenge.

So far, I got this complicated solution:

Get all of our timestamps in all of our documents. This is the purpose of the first 4 steps. $setWindowFields is used to accumulate this data.
$group by owner and calculate the empty timestamps as wantedTimes- next 5 steps.
$set empty timestamps with tokens: null to be filled with actual data and $unwind to separate - next 3 steps
Use $setWindowFields to find the last known token for each owner at each timestamp.
Fill this last known state for documents with unknown token - 2 steps
$group and format answer:

db.collection.aggregate([
  {
    $setWindowFields: {
      sortBy: {time: 1},
      output: {
        allTimes: {$addToSet: "$time", window: {documents: ["unbounded", "current"]}
        }
      }
    }
  },
  {
    $setWindowFields: {
      sortBy: {time: -1},
      output: {
        allTimes: {$addToSet: "$allTimes", window: {documents: ["unbounded", "current"]}
        }
      }
    }
  },
  {
    $set: {
      allTimes: {
        $reduce: {
          input: "$allTimes",
          initialValue: [],
          in: {"$concatArrays": ["$$value", "$$this"]}
        }
      }
    }
  },
  {$set: {allTimes: {$setIntersection: "$allTimes"}}},
  {$unwind: "$holdings"},
  {$sort: {time: 1}},
  {$group: { _id: "$holdings.owner",
      tokens: {$push: {tokens: "$holdings.tokens", time: "$time"}},
      times: {$push: "$time"}, firstTime: {$first: "$time"},
      allTimes: {$first: "$allTimes"}}
  },
  {
    $addFields: {
      wantedTimes: {
        $filter: {
          input: "$allTimes",
          as: "item",
          cond: {$gte: ["$$item", "$firstTime"]}
        }
      }
    }
  },
  {
    $project: {
      tokens: 1,
      wantedTimes: {$setDifference: ["$wantedTimes", "$times"]}
    }
  },
  {
    $set: {
      data: {
        $map: {
          input: "$wantedTimes",
          as: "item",
          in: {time: "$$item", tokens: null}
        }
      }
    }
  },
  {$project: {tokens: {"$concatArrays": ["$tokens", "$data"]}}},
  {$unwind: "$tokens"},
  {
    $setWindowFields: {
      partitionBy: "$_id",
      sortBy: {"tokens.time": 1},
      output: {
        lastTokens: {
          $push: "$tokens.tokens",
          window: {documents: ["unbounded", "current"]}
        }
      }
    }
  },
  {
    $set: {
      lastTokens: {
        $filter: {
          input: "$lastTokens",
          as: "item",
          cond: {$ne: ["$$item", null]}
        }
      }
    }
  },
  {
    $set: {
      "tokens.tokens": {$ifNull: ["$tokens.tokens", {$last: "$lastTokens"}]}
    }
  },
  {
    $group: {
      _id: "$tokens.time",
      holdings: {$push: {owner: "$_id", tokens: "$tokens.tokens" }}
    }
  },
  {$project: {time: "$_id", holdings: 1, _id: 0}},
  {$sort: {time: 1}}
])

Playground example

CodePudding user response：

From a performance perspective I recommend you split it into 2 calls, the first will be a quick findOne just to get the maximum time value in the collection.

Once you have that value the pipeline can be much leaner:

const maxItem = await db.collection.findOne({}).sort({ time: -1 });

db.collection.aggregate([
  {
    $unwind: "$holdings"
  },
  {
    $group: {
      _id: "$holdings.owner",
      times: {
        $push: {
          time: "$time",
          tokens: "$holdings.tokens"
        }
      },
      minTime: {
        $min: "$time"
      }
    }
  },
  {
    $addFields: {
      times: {
        $reduce: {
          input: {
            $range: [
              "$minTime",
              maxItem.time   1 // this is max time
            ]
          },
          initialValue: {
            values: [],
            lastIndex: 0
          },
          in: {
            values: {
              "$concatArrays": [
                "$$value.values",
                [
                  {
                    $cond: [
                      {
                        $in: [
                          "$$this",
                          "$times.time"
                        ]
                      },
                      {
                        "$arrayElemAt": [
                          "$times",
                          "$$value.lastIndex"
                        ]
                      },
                      {
                        "$mergeObjects": [
                          {
                            tokens: 0
                          },
                          {
                            "$arrayElemAt": [
                              "$times",
                              {
                                $subtract: [
                                  "$$value.lastIndex",
                                  1
                                ]
                              }
                            ]
                          },
                          {
                            time: "$$this"
                          }
                        ]
                      }
                    ]
                  }
                ]
              ]
            },
            lastIndex: {
              $cond: [
                {
                  $in: [
                    "$$this",
                    "$times.time"
                  ]
                },
                {
                  $sum: [
                    "$$value.lastIndex",
                    1
                  ]
                },
                "$$value.lastIndex"
              ]
            }
          }
        }
      }
    }
  },
  {
    $unwind: "$times.values"
  },
  {
    $group: {
      _id: "$times.values.time",
      holdings: {
        $push: {
          owner: "$_id",
          tokens: "$times.values.tokens"
        }
      }
    }
  },
  {
    $project: {
      _id: 0,
      time: "$_id",
      holdings: 1
    }
  },
  {
    $sort: {
      time: 1
    }
  }
])

This is still quite a heavy query as it requires to $unwind and $group the entire collection, however there is no workaround this due to the requirements. if the collection is too big for this approach I recommend iteration owner by owner, or time by time and doing separate updates accordingly.

Mongo Playground

If you don't care about performance at all and want it in a single query you can still use the same pipeline, you will have to first extract the max time in the collection, this will require you to add an initial $group stage, like so:

db.collection.aggregate([
  {
    $group: {
      _id: null,
      maxTime: {
        $max: "$time"
      },
      roots: {
        $push: "$$ROOT"
      }
    }
  },
  {
    $unwind: "$roots"
  },
  {
    $replaceRoot: {
      newRoot: {
        "$mergeObjects": [
          "$roots",
          {
            maxTime: "$maxTime"
          }
        ]
      }
    }
  },
  ... same pipeline ...
])