How to capture the average of multiple categories?-CodePudding

I am trying to find the average number of purchases by buyer by store without surfacing buyer because there are millions.

I'm getting an error of invalid identifier trying to group by store and am not sure what I'm missing or if there's a better way to do this. The sample data looks like this, but with millions of records.

Purchase_ID	Buyer_ID	Store
abc	1a	East
abd	1a	East
abe	1b	East
abf	1c	West
abg	1c	West
abh	1d	South
abi	1e	North
abj	1f	North

And the ideal output would look like:

t.store	average_purchases_per_store
East	1.5
West	2
South	1
North	1

Sample code:

SELECT t.store,AVG(T.distinct_purchases) as average_purchases_per_store
FROM 
    (SELECT COUNT(DISTINCT(purchase_id)) AS distinct_purchases
    FROM table GROUP BY buyer) AS T GROUP BY t.store

Any help would be hugely appreciated.

CodePudding user response：

Greg's answer is almost correct, but he lost the DISTINCT thus is a ling repeats, the value is lost:

with T1(PURCHASE_ID,BUYER_ID, STORE) as (
    select * from values
        ('abc','1a','East'),
        ('abc','1a','East'),
        ('abd','1a','East'),
        ('abe','1b','East'),
        ('abf','1c','West'),
        ('abg','1c','West'),
        ('abh','1d','South'),
        ('abi','1e','North'),
        ('abj','1f','North')
), BUYER_PURCHASES as (
    select BUYER_ID
        ,STORE
        ,count(distinct PURCHASE_ID) as PURCHASES
    from T1
    group by 1,2
)
select STORE
  ,avg(PURCHASES) as average_purchases_per_store
from     BUYER_PURCHASES
group by STORE

gives:

STORE	AVERAGE_PURCHASES_PER_STORE
East	1.5
West	2
North	1
South	1

CodePudding user response：

You just need to aggregate to buyers and stores first, and from that intermediate result aggregate to store:

create or replace table T1(PURCHASE_ID string, BUYER_ID string, STORE string);

insert into T1 (PURCHASE_ID,BUYER_ID, STORE) values
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North');

with BUYER_PURCHASES as
(
select       BUYER_ID
            ,STORE
            ,count(*) as PURCHASES
from T1
group by    BUYER_ID, STORE
)
select   STORE
        ,avg(PURCHASES) as average_purchases_per_store
from     BUYER_PURCHASES
group by STORE
;

Output:

STORE	AVERAGE_PURCHASES_PER_STORE
East	1.5
West	2
South	1
North	1

Note that you don't need to use the distinct keyword unless you have to filter out duplicate rows. If you do have duplicates, that should be addressed on ETL/ELT.

CodePudding user response：

Hopefully this is enough to get you started. There's literally thousands of possible approaches that depending on your datasets (you mentioned there's millions of rows) may provide you more flexibility or speed etc. High level approach would be to reduce the number of rows as quickly as possible. The first count distinct query should include as many predicates as you can to prevent any extra work. Hope this helps :-)

SELECT 
     STORE 
    ,AVG(DISTINCT_STORE_PURCHASES) AVG_PURCHASES_PER_STORE
    ,AVG(DISTINCT_BUYER_PURCHASES) AVG_BUYER_PURCHASES_PER_STORE
FROM 
    (SELECT 
        STORE
      , COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY BUYER_ID) DISTINCT_BUYER_PURCHASES 
      , DIV0(COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY STORE), COUNT(DISTINCT BUYER_ID) OVER (PARTITION BY STORE) ) DISTINCT_STORE_PURCHASES  
      FROM CTE) 
GROUP BY 
      STORE  ;