filter on partition key before iterating over array cosmos db-CodePudding

I have a CosmosDbQuery that works fine but is a bit slow and expensive:

SELECT c.actionType as actionType, count(1) as count 
FROM c in t.processList
WHERE c.processTimestamp > @from
GROUP BY c.actionType

To optimise my query i would like to first have a Where clause on my parent partitionKey e.g. parent.minute > x before iterating over the processlist. After this where there is no need for the c.processTimestamp > @from.

"id": "b6fd10cc-3a0b-4666-bf55-f22436a5f8d9",
"Name": "xxx",
"Age": 1,
"minute": 202302021026,
"processList": [
   {
      "processTimestamp": "2023-02-01T10:28:48.3004825Z",
      "actionType": "Action1",
      "oldValue": "2/1/2023 10:28:41 AM",
      "newValue": "2/1/2023 10:28:48 AM"
   },
   {
      "processTimestamp": "2023-02-01T10:28:48.3004825Z",
      "actionType": "Action2",
      "oldValue": "2/1/2023 10:28:48 AM",
      "newValue": "2/1/2023 10:28:48 AM"
   }],
}

I have tried subqueries and joins but i could not get it to work:

SELECT c.actionType as actionType, count(1) as count 
FROM (SELECT * FROM C WHERE c.minute > 9) in t.processList
WHERE c.processTimestamp > @from
GROUP BY c.actionType")

My desired result would be:

[
    {
        "actionType": "action1",
        "count": 85351
    },
    {
        "actionType": "action2",
        "count": 2354
    }
]

CodePudding user response：

A few comments here.

As noted in my comment, Group By with Sub-Queries is unsupported, documented here.

Using a Date/Time value as a partition key is typically an anti-pattern for Cosmos DB. This query may be slow and expensive because at large scales, using time as a partition key means that most queries are hitting the same partition due of data recency (newer data gets more requests than older data). This is also bad for writes as well for the same reason.

When this happens, it is typical to increase the throughput. However this often does little to help and in some cases can even make things worse. Also, because throughput is evenly distributed across all partitions, this results in wasted unused throughput on partition keys for older dates.

Two things to consider. Make your partition key a combination of two properties to increase cardinality. In an IOT scenario this would typically be deviceId_dateTime (Hierarchical Partition keys, in preview now, is a better way you can do this today). This will help with writes especially where data is always written with the current dateTime.

On the read path for queries, you might explore implementing a materialized view using Change Feed into a second container. This will move the throughput for reads off of the container used for ingestion and can result in more efficient throughput usage. However, you should measure this yourself to be sure.

If your container is small and will always stay that way, then this information below will not apply (< 10K RU/s and 50GB). However, such a design will not scale.

CodePudding user response：

Like Mark said: Groupby is not supported on a subquery. Tried to fix it with linq but groupby is not supported for linq aswell so i changed my code so it uses join insteadof looping over the array with the IN keyword:

SELECT pl.actionType as actionType, count(1) as count 
FROM c 
JOIN pl IN c.processList 
WHERE c.minute > @from
GROUP BY pl.actionType")