Home > Enterprise >  Azure Stream Analytics: Making the TUMBLINGWINDOW function start date of each window inclusive and e
Azure Stream Analytics: Making the TUMBLINGWINDOW function start date of each window inclusive and e

Time:02-23

Context

I have a question regarding azure stream analytics (which uses T-SQL like language) and in particular working with the offset parameter inside the argument of the TUMBLING WINDOW function. Basically I am trying to use the offset argument to make the start time of the window interval inclusive and the end window time exclusive (which is the opposite to the default behaviour).

Here is the reference documentation: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Question

The documentation mentions this can be done with offset and gives an example but don't really understand how it works and want to be able to apply it to the scenario where I make the TUMBLING WINDOW interval 1 day (not sure if that makes a difference or not to the parameters passed into offset). I haven't managed to find any clear explanations of this so would great if anyone has any insights.

Tried

SELECT System.Timestamp() AS WindowEnd, TollId, COUNT(*)  
FROM Input TIMESTAMP BY EntryTime  
GROUP BY TollId, TumblingWindow(Duration(day, 1), Offset(millisecond, -1))

Here I have just taken the example in the docs and altered the time window duration to my desired duration while leaving the offset as is. Not sure if this is right (doesn't seem right) but not entirely sure how to use the offset to make the window inclusive on the start window and exclusive on the end like it mentions in the documentation example.

CodePudding user response:

First let's mention two good practices when writing a query with a temporal element:

  • If you're developing locally in VS Code, please use TIMESTAMP BY, or the whole file will be loaded on a single timestamp (query start time) which will make all temporal logic moot. If you don't have a event timestamp, or don't want to use one from the payload, you can always use TIMESTAMP BY EventEnqueuedUtcTime (that you will need to add to your local data sample) which is the default implicit behavior on event hub anyway
  • In your query, make the window bounds visible by selecting both the WindowStart and WindowEnd like this:
SELECT
    --This is for a 1 minute window, adjust as necessary
    DATEADD(minute,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    ...

Now that we have that, we can look at the offset syntax. The best in my opinion is to look at it via an example (that I should then put in the doc...):

My input file:

{"ReadingTimestamp":"2021-12-10T10:00:00.0000000","deviceId":1}
{"ReadingTimestamp":"2021-12-10T10:00:59.0000000","deviceId":2}
{"ReadingTimestamp":"2021-12-10T10:01:00.0000000","deviceId":3}
{"ReadingTimestamp":"2021-12-10T10:01:01.0000000","deviceId":4}

I will look count these records on a tumbling window of 1 minute, and see how they fall in and out of them depending on offset.

That's the default behavior:

SELECT
    DATEADD(minute,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    COUNT(*) AS C
FROM Input1 TIMESTAMP BY ReadingTimestamp
GROUP BY Tumbling(duration(minute,1))

Which gives me:

WindowStart WindowEnd C
2021-12-10T09:59:00.0000000Z 2021-12-10T10:00:00.0000000Z 1
2021-12-10T10:00:00.0000000Z 2021-12-10T10:01:00.0000000Z 2
2021-12-10T10:01:00.0000000Z 2021-12-10T10:02:00.0000000Z 1

Notice how the first record falls on 10:00:00 and gets pushed in the previous window.

Then using the offset:

SELECT
    DATEADD(minute,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    COUNT(*) AS C
INTO Output3
FROM Input1 TIMESTAMP BY ReadingTimestamp
GROUP BY Tumbling(duration(minute,1),offset(millisecond,-1))

We have:

WindowStart WindowEnd C
2021-12-10T09:59:59.9990000Z 2021-12-10T10:00:59.9990000Z 2
2021-12-10T10:00:59.9990000Z 2021-12-10T10:01:59.9990000Z 2

Where records 1 and 2 are now part of the first window, and 3 and 4 the second.

But also note how we haven't made the previous windows switch the behavior at their bounds (from inclusive to exclusive and vice versa), we have simply moved the goalpost by an epsilon value to emulate the behavior.

So:

  • 09:59:00 to 10:00:00 : received no events so wasn't displayed anymore
  • 10:00:00 to 10:01:00 : became 09:59:59.999 to 10:00:59.999
  • 10:01:00 to 10:02:00 : became 10:00:59.999 to 10:01:59.999

But as long as you can find an offset that is small enough to be considered negligible by your use case you are good. We have microsecond so I'm guessing that should work out fine ;)

  • Related