Home > Blockchain >  How to avoid double counting while using SQL
How to avoid double counting while using SQL

Time:05-13

I have a transaction table which contains date, transaction_id and category (it is a sales table of clothes). It looks like this:

ndate transaction_id category
11 111 M
11 111 F
11 112 M
12 113 F
12 113 O
13 114 M

I am applying a sql query to aggregate the data and make a new table to reduce the number of rows (as the data is too huge):

select ndate, category, count(distinct transaction_id)
from transaction_table
group by ndate, category;

The output that I am getting is this:

ndate category transactions
11 M 2
11 F 1
12 F 1
12 O 1
13 M 1

As you can see from the transaction_table there have been only 4 transactions but in the output_table the total count of transaction is 6 which is wrongly been done due to the group by of category.

This is the output I am trying to get, I am not sure if a table like this can be made in sql (I am new to sql) :

enter image description here

CodePudding user response:

You haven't told us what you expect, but I think from reading your question that you need to GROUP BY category, but you only want to count a transaction_id one time if it happened to exist in 2 categories.

We can't have it both ways, so we have to choose - if we list the category in GROUP BY then we have to decide how to count a transaction that happens in 2 categories.

Option 1: Choose "first" category (shown alphabetically)

WITH CTE_DEDUPE AS (
  SELECT 
    * 
  FROM 
    transaction_table QUALIFY ROW_NUMBER() OVER (
      PARTITION BY TRANSACTION_ID 
      ORDER BY 
        TRANSACTION_DT, 
        CATEGORY ASC
    ) = 1
), 
BASIC_AGGS AS (
  SELECT 
    TRANSACTION_DT, 
    CATEGORY, 
    COUNT(DISTINCT TRANSACTION_ID) as TRANSACTION_ID_COUNTDISTINCT, 
    COUNT(1) AS AGG_ROW_COUNT 
  FROM 
    CTE_DEDUPE 
  GROUP BY 
    TRANSACTION_DT, 
    CATEGORY
) 
SELECT 
  BASIC_AGGS.TRANSACTION_DT, 
  BASIC_AGGS.CATEGORY, 
  BASIC_AGGS.TRANSACTION_ID_COUNTDISTINCT 
FROM 
  BASIC_AGGS

Result:

TRANSACTION_DT CATEGORY TRANSACTION_ID_COUNTDISTINCT
11 F 1
12 F 1
13 M 1

There are other options but I will hold off until you tell us some more information about what you expect, and what specific instance of SQL you are running. I generated the SQL with output:

  • Related