Retrieving last record in each group from an external table with Azure Synapse Serverless Pool-CodePudding

So the question is ¿what could be the best way to get the last record from a big external table hosted on Synapse Serverless pool?. This table has a column key that works as a primary key and also has another timestamp colunm, for example:

col_key | col_a | col_b | col_date
a1         val1   val2    2018-03-05T18:24:40.228 00:00
a1         val10  val2    2018-03-06T18:24:40.228 00:00
a1         val12  val29   2018-03-07T18:24:40.228 00:00
a2         val1   val2    2018-03-04T18:24:40.228 00:00
a2         val1   val2    2018-03-05T18:24:40.228 00:00
a2         val1   val42   2018-03-09T18:24:40.228 00:00
a3         val1   val2    2018-03-05T18:24:40.228 00:00
a3         val1   val9    2018-03-07T18:24:40.228 00:00
a3         val1   val32   2018-03-08T18:24:40.228 00:00

so the query will return a compacted table:

col_key | col_a | col_b | col_date
a1         val12  val29   2018-03-07T18:24:40.228 00:00
a2         val1   val42   2018-03-09T18:24:40.228 00:00
a3         val1   val32   2018-03-08T18:24:40.228 00:00

it can not use spark; ¿what kind of optimized query with good performance can be used?

CodePudding user response：

The best way depends on the amount of data you have and the way your data is organized. Possibly the best starting point in this scenario would be to query by ranking by function row_number() with sorted partitions by the col_date. You can do this with a query like this:

SELECT col_key, col_a, col_b, col_date 
FROM (
    SELECT
    col_key, col_a, col_b, col_date, 
    ROW_NUMBER() OVER (PARTITION BY col_key ORDER BY col_date DESC) as rn
    FROM
        OPENROWSET(...your pointer to data here...
        ) AS sortedresult
) finalresult where finalresult.rn=1

Of course the actual performance and cost details depend on your actual way of storing the data for the query, and your definition of "good performance". With Azure Synapse Analytics you have limitless scalability, and with it and the pay-as-you-go pricing you will need a limitless wallet if you do not optimize for costs of your queries. So, check with your real data how the query performs and what it costs. Without real data it does not make sense to optimize this.

More info related to this can be found here:

ROW_NUMBER(): https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql
Serverless pool best-practices: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool

CodePudding user response：

If your date column is actually DATETIMEOFFSET format then I don't think this is supported for Azure Synapse serverless SQL pools and external tables as a datatype so you might receive this error:

Msg 15815, Level 16, State 1, Line 80 External data type 'DATETIMEOFFSET' is currently not supported. File: 'https://someStorage.dfs.core.windows.net/somelake/raw/serverlessTable.psv'.

It will let you create the table, the error will occur when you query it. Date format options are also not currently supported (where they are for dedicated SQL pools) so I think you will have to import the date column as VARCHAR and then cast it to DATETIMEOFFSET, something like this:

;WITH cte AS (
SELECT
    col_key,
    col_a,
    col_b,
    TRY_CAST( col_date AS DATETIMEOFFSET ) col_date
FROM dbo.yourExternalTable
), cte2 AS (
SELECT *, ROW_NUMBER() OVER( PARTITION BY col_key ORDER BY col_date DESC ) rn
FROM cte
)
SELECT col_key, col_a, col_b, col_date
FROM cte2
WHERE rn = 1;