I am trying to use the QUALIFY Clause in Snowflake since it does not support the use of DISTINCT ON as Postgres does. However, when I use the QUALIFY Clause in my query it returns different results from that in Postgres, and each time I run this snowflake for a result set of thousands of records it returns a different answer while if I run it for a few hundreds of records it provides the correct result each time. While in Postgres it returns the correct result each time regardless of the size of the result set. According to these pages:
Using `DISTINCT ON` in Snowflake
DISTINCT ON and QUALIFY should work in the same way, so I'm not exactly sure why I am having different results?
Snowflake:
SELECT wallet_transaction.walletId,
wallet_transaction.balance,
wallet_transaction.transactionDate,
wallet_transaction.parkerId,
wallet_transaction.operatorName,
wallet_transaction.operatorId,
wallet_transaction.offerName,
wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY wallet_transaction.walletId
ORDER BY wallet_transaction.walletId, wallet_transaction.transactionDate DESC) = 1;
Postgres:
SELECT DISTINCT ON(wallet_transaction.walletId)
wallet_transaction.walletId,
wallet_transaction.balance,
wallet_transaction.transactionDate,
wallet_transaction.parkerId,
wallet_transaction.operatorName,
wallet_transaction.operatorId,
wallet_transaction.offerName,
wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
ORDER BY wallet_transaction.walletId,wallet_transaction.transactionDate DESC;
CodePudding user response:
ROW_NUMBER() OVER (PARTITION BY wallet_transaction.walletId ORDER BY wallet_transaction.walletId, wallet_transaction.transactionDate DESC)
does not make sense for column wallet_transaction.walletId
being present on both PARTITION BY and ORDER BY.
SELECT
wallet_transaction.walletId,
wallet_transaction.balance,
wallet_transaction.transactionDate,
wallet_transaction.parkerId,
wallet_transaction.operatorName,
wallet_transaction.operatorId,
wallet_transaction.offerName,
wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY wallet_transaction.walletId
ORDER BY wallet_transaction.transactionDate DESC) = 1;
It means that column wallet_transaction.transactionDate
alone does not provide stable sort.
It could be confirmed by checking windowed COUNT
partitioned by all columns present on PARTITION and ORDER BY clauses:
SELECT
wallet_transaction.walletId,
wallet_transaction.balance,
wallet_transaction.transactionDate,
wallet_transaction.parkerId,
wallet_transaction.operatorName,
wallet_transaction.operatorId,
wallet_transaction.offerName,
wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
QUALIFY COUNT() OVER (PARTITION BY wallet_transaction.walletId,
wallet_transaction.transactionDate DESC) > 1;
CodePudding user response:
wallet_transaction.walletId
is redundant in the ORDER BY
clause of row_number()
, but that cannot explain your observation.
Also assuming your opaque WHERE
condition isn't broken somehow.
Duplicates per (walletId, transactionDate)
could explain it. Then the result is arbitrary and can change between executions without changes to the table. Postgres typically picks the first row in physical sort order from duplicates. Snowflake has a completely different storage architecture and may arrive at a different pick.
This can even happen with a UNIQUE
constraint on (walletId, transactionDate)
if transactionDate
can be NULL
, as NULL
compares distinct for the purpose of a UNIQUE
constraint (as per SQL standard). See:
If so, add more ORDER BY
expressions to make the sort order deterministic. The PK column of the table would do it - assuming transaction_id
for lack of declaration. Then you get the row with the smallest transaction_id
in case of duplicates. Works for Postgres and Snowflake alike.
Snowflake:
SELECT walletId
, balance
, transactionDate
, parkerId
, operatorName
, operatorId
, offerName
, offerId
FROM wallet_transaction w
WHERE {% condition snapshot_date %} transactionDate {% endcondition %}
QUALIFY row_number() OVER (PARTITION BY walletId
ORDER BY transactionDate DESC, transaction_id) = 1;
Postgres:
SELECT DISTINCT ON (walletId)
walletId
, balance
, transactionDate
, parkerId
, operatorName
, operatorId
, offerName
, offerId
FROM wallet_transaction w
WHERE {% condition snapshot_date %} transactionDate {% endcondition %}
ORDER BY walletId, transactionDate DESC, transaction_id;
See: