So the question is ¿what could be the best way to get the last record from a big external table hosted on Synapse Serverless pool?. This table has a column key that works as a primary key and also has another timestamp colunm, for example:
col_key | col_a | col_b | col_date
a1 val1 val2 2018-03-05T18:24:40.228 00:00
a1 val10 val2 2018-03-06T18:24:40.228 00:00
a1 val12 val29 2018-03-07T18:24:40.228 00:00
a2 val1 val2 2018-03-04T18:24:40.228 00:00
a2 val1 val2 2018-03-05T18:24:40.228 00:00
a2 val1 val42 2018-03-09T18:24:40.228 00:00
a3 val1 val2 2018-03-05T18:24:40.228 00:00
a3 val1 val9 2018-03-07T18:24:40.228 00:00
a3 val1 val32 2018-03-08T18:24:40.228 00:00
so the query will return a compacted table:
col_key | col_a | col_b | col_date
a1 val12 val29 2018-03-07T18:24:40.228 00:00
a2 val1 val42 2018-03-09T18:24:40.228 00:00
a3 val1 val32 2018-03-08T18:24:40.228 00:00
it can not use spark; ¿what kind of optimized query with good performance can be used?
CodePudding user response:
The best way depends on the amount of data you have and the way your data is organized. Possibly the best starting point in this scenario would be to query by ranking by function row_number() with sorted partitions by the col_date. You can do this with a query like this:
SELECT col_key, col_a, col_b, col_date
FROM (
SELECT
col_key, col_a, col_b, col_date,
ROW_NUMBER() OVER (PARTITION BY col_key ORDER BY col_date DESC) as rn
FROM
OPENROWSET(...your pointer to data here...
) AS sortedresult
) finalresult where finalresult.rn=1
Of course the actual performance and cost details depend on your actual way of storing the data for the query, and your definition of "good performance". With Azure Synapse Analytics you have limitless scalability, and with it and the pay-as-you-go pricing you will need a limitless wallet if you do not optimize for costs of your queries. So, check with your real data how the query performs and what it costs. Without real data it does not make sense to optimize this.
More info related to this can be found here:
- ROW_NUMBER(): https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql
- Serverless pool best-practices: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool
CodePudding user response:
If your date column is actually DATETIMEOFFSET
format then I don't think this is supported for Azure Synapse serverless SQL pools and external tables as a datatype so you might receive this error:
Msg 15815, Level 16, State 1, Line 80 External data type 'DATETIMEOFFSET' is currently not supported. File: 'https://someStorage.dfs.core.windows.net/somelake/raw/serverlessTable.psv'.
It will let you create the table, the error will occur when you query it. Date format options are also not currently supported (where they are for dedicated SQL pools) so I think you will have to import the date column as VARCHAR
and then cast it to DATETIMEOFFSET
, something like this:
;WITH cte AS (
SELECT
col_key,
col_a,
col_b,
TRY_CAST( col_date AS DATETIMEOFFSET ) col_date
FROM dbo.yourExternalTable
), cte2 AS (
SELECT *, ROW_NUMBER() OVER( PARTITION BY col_key ORDER BY col_date DESC ) rn
FROM cte
)
SELECT col_key, col_a, col_b, col_date
FROM cte2
WHERE rn = 1;