Home > Software engineering >  How to capture the average of multiple categories?
How to capture the average of multiple categories?

Time:08-04

I am trying to find the average number of purchases by buyer by store without surfacing buyer because there are millions.

I'm getting an error of invalid identifier trying to group by store and am not sure what I'm missing or if there's a better way to do this. The sample data looks like this, but with millions of records.

Purchase_ID Buyer_ID Store
abc 1a East
abd 1a East
abe 1b East
abf 1c West
abg 1c West
abh 1d South
abi 1e North
abj 1f North

And the ideal output would look like:

t.store average_purchases_per_store
East 1.5
West 2
South 1
North 1

Sample code:

SELECT t.store,AVG(T.distinct_purchases) as average_purchases_per_store
FROM 
    (SELECT COUNT(DISTINCT(purchase_id)) AS distinct_purchases
    FROM table GROUP BY buyer) AS T GROUP BY t.store

Any help would be hugely appreciated.

CodePudding user response:

Greg's answer is almost correct, but he lost the DISTINCT thus is a ling repeats, the value is lost:

with T1(PURCHASE_ID,BUYER_ID, STORE) as (
    select * from values
        ('abc','1a','East'),
        ('abc','1a','East'),
        ('abd','1a','East'),
        ('abe','1b','East'),
        ('abf','1c','West'),
        ('abg','1c','West'),
        ('abh','1d','South'),
        ('abi','1e','North'),
        ('abj','1f','North')
), BUYER_PURCHASES as (
    select BUYER_ID
        ,STORE
        ,count(distinct PURCHASE_ID) as PURCHASES
    from T1
    group by 1,2
)
select STORE
  ,avg(PURCHASES) as average_purchases_per_store
from     BUYER_PURCHASES
group by STORE

gives:

STORE AVERAGE_PURCHASES_PER_STORE
East 1.5
West 2
North 1
South 1

CodePudding user response:

You just need to aggregate to buyers and stores first, and from that intermediate result aggregate to store:

create or replace table T1(PURCHASE_ID string, BUYER_ID string, STORE string);

insert into T1 (PURCHASE_ID,BUYER_ID, STORE) values
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North');

with BUYER_PURCHASES as
(
select       BUYER_ID
            ,STORE
            ,count(*) as PURCHASES
from T1
group by    BUYER_ID, STORE
)
select   STORE
        ,avg(PURCHASES) as average_purchases_per_store
from     BUYER_PURCHASES
group by STORE
;

Output:

STORE AVERAGE_PURCHASES_PER_STORE
East 1.5
West 2
South 1
North 1

Note that you don't need to use the distinct keyword unless you have to filter out duplicate rows. If you do have duplicates, that should be addressed on ETL/ELT.

CodePudding user response:

Hopefully this is enough to get you started. There's literally thousands of possible approaches that depending on your datasets (you mentioned there's millions of rows) may provide you more flexibility or speed etc. High level approach would be to reduce the number of rows as quickly as possible. The first count distinct query should include as many predicates as you can to prevent any extra work. Hope this helps :-)

enter image description here

SELECT 
     STORE 
    ,AVG(DISTINCT_STORE_PURCHASES) AVG_PURCHASES_PER_STORE
    ,AVG(DISTINCT_BUYER_PURCHASES) AVG_BUYER_PURCHASES_PER_STORE
FROM 
    (SELECT 
        STORE
      , COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY BUYER_ID) DISTINCT_BUYER_PURCHASES 
      , DIV0(COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY STORE), COUNT(DISTINCT BUYER_ID) OVER (PARTITION BY STORE) ) DISTINCT_STORE_PURCHASES  
      FROM CTE) 
GROUP BY 
      STORE  ;
  • Related