DISTINCT ON(Postgres) vs QUALIFY(Snowflake)-CodePudding

I am trying to use the QUALIFY Clause in Snowflake since it does not support the use of DISTINCT ON as Postgres does. However, when I use the QUALIFY Clause in my query it returns different results from that in Postgres, and each time I run this snowflake for a result set of thousands of records it returns a different answer while if I run it for a few hundreds of records it provides the correct result each time. While in Postgres it returns the correct result each time regardless of the size of the result set. According to these pages:

Using `DISTINCT ON` in Snowflake

https://community.snowflake.com/s/question/0D50Z00008NrIYuSAN/whats-the-equivalent-of-distinct-on-in-snowflake-

DISTINCT ON and QUALIFY should work in the same way, so I'm not exactly sure why I am having different results?

Snowflake:

SELECT wallet_transaction.walletId,
       wallet_transaction.balance,
       wallet_transaction.transactionDate,
       wallet_transaction.parkerId,
       wallet_transaction.operatorName,
       wallet_transaction.operatorId,
       wallet_transaction.offerName,
       wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY wallet_transaction.walletId
ORDER BY wallet_transaction.walletId, wallet_transaction.transactionDate DESC) = 1;

Postgres:

SELECT DISTINCT ON(wallet_transaction.walletId)
       wallet_transaction.walletId,
       wallet_transaction.balance,
       wallet_transaction.transactionDate,
       wallet_transaction.parkerId,
       wallet_transaction.operatorName,
       wallet_transaction.operatorId,
       wallet_transaction.offerName,
       wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
ORDER BY wallet_transaction.walletId,wallet_transaction.transactionDate DESC;

CodePudding user response：

ROW_NUMBER() OVER (PARTITION BY wallet_transaction.walletId ORDER BY wallet_transaction.walletId, wallet_transaction.transactionDate DESC) does not make sense for column wallet_transaction.walletId being present on both PARTITION BY and ORDER BY.

SELECT
    wallet_transaction.walletId,
    wallet_transaction.balance,
    wallet_transaction.transactionDate,
    wallet_transaction.parkerId,
    wallet_transaction.operatorName,
    wallet_transaction.operatorId,
    wallet_transaction.offerName,
    wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY wallet_transaction.walletId
       ORDER BY wallet_transaction.transactionDate DESC) = 1;

It means that column wallet_transaction.transactionDate alone does not provide stable sort.

It could be confirmed by checking windowed COUNT partitioned by all columns present on PARTITION and ORDER BY clauses:

SELECT
    wallet_transaction.walletId,
    wallet_transaction.balance,
    wallet_transaction.transactionDate,
    wallet_transaction.parkerId,
    wallet_transaction.operatorName,
    wallet_transaction.operatorId,
    wallet_transaction.offerName,
    wallet_transaction.offerId
FROM wallet_transaction
WHERE {% condition snapshot_date %} wallet_transaction.transactionDate {% endcondition %}
QUALIFY COUNT() OVER (PARTITION BY wallet_transaction.walletId,
                      wallet_transaction.transactionDate DESC) > 1;

CodePudding user response：

wallet_transaction.walletIdis redundant in the ORDER BY clause of row_number(), but that cannot explain your observation.

Also assuming your opaque WHERE condition isn't broken somehow.

Duplicates per (walletId, transactionDate) could explain it. Then the result is arbitrary and can change between executions without changes to the table. Postgres typically picks the first row in physical sort order from duplicates. Snowflake has a completely different storage architecture and may arrive at a different pick.

This can even happen with a UNIQUE constraint on (walletId, transactionDate) if transactionDate can be NULL, as NULL compares distinct for the purpose of a UNIQUE constraint (as per SQL standard). See:

Create unique constraint with null columns

If so, add more ORDER BY expressions to make the sort order deterministic. The PK column of the table would do it - assuming transaction_id for lack of declaration. Then you get the row with the smallest transaction_id in case of duplicates. Works for Postgres and Snowflake alike.

Snowflake:

SELECT walletId
     , balance
     , transactionDate
     , parkerId
     , operatorName
     , operatorId
     , offerName
     , offerId
FROM   wallet_transaction w
WHERE  {% condition snapshot_date %} transactionDate {% endcondition %}
QUALIFY row_number() OVER (PARTITION BY walletId
                           ORDER BY transactionDate DESC, transaction_id) = 1;

Postgres:

SELECT DISTINCT ON (walletId)
       walletId
     , balance
     , transactionDate
     , parkerId
     , operatorName
     , operatorId
     , offerName
     , offerId
FROM   wallet_transaction w
WHERE  {% condition snapshot_date %} transactionDate {% endcondition %}
ORDER  BY walletId, transactionDate DESC, transaction_id;

See:

Select first row in each GROUP BY group?