I am trying to find the average number of purchases by buyer by store without surfacing buyer because there are millions.
I'm getting an error of invalid identifier trying to group by store and am not sure what I'm missing or if there's a better way to do this. The sample data looks like this, but with millions of records.
Purchase_ID | Buyer_ID | Store |
---|---|---|
abc | 1a | East |
abd | 1a | East |
abe | 1b | East |
abf | 1c | West |
abg | 1c | West |
abh | 1d | South |
abi | 1e | North |
abj | 1f | North |
And the ideal output would look like:
t.store | average_purchases_per_store |
---|---|
East | 1.5 |
West | 2 |
South | 1 |
North | 1 |
Sample code:
SELECT t.store,AVG(T.distinct_purchases) as average_purchases_per_store
FROM
(SELECT COUNT(DISTINCT(purchase_id)) AS distinct_purchases
FROM table GROUP BY buyer) AS T GROUP BY t.store
Any help would be hugely appreciated.
CodePudding user response:
Greg's answer is almost correct, but he lost the DISTINCT thus is a ling repeats, the value is lost:
with T1(PURCHASE_ID,BUYER_ID, STORE) as (
select * from values
('abc','1a','East'),
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North')
), BUYER_PURCHASES as (
select BUYER_ID
,STORE
,count(distinct PURCHASE_ID) as PURCHASES
from T1
group by 1,2
)
select STORE
,avg(PURCHASES) as average_purchases_per_store
from BUYER_PURCHASES
group by STORE
gives:
STORE | AVERAGE_PURCHASES_PER_STORE |
---|---|
East | 1.5 |
West | 2 |
North | 1 |
South | 1 |
CodePudding user response:
You just need to aggregate to buyers and stores first, and from that intermediate result aggregate to store:
create or replace table T1(PURCHASE_ID string, BUYER_ID string, STORE string);
insert into T1 (PURCHASE_ID,BUYER_ID, STORE) values
('abc','1a','East'),
('abd','1a','East'),
('abe','1b','East'),
('abf','1c','West'),
('abg','1c','West'),
('abh','1d','South'),
('abi','1e','North'),
('abj','1f','North');
with BUYER_PURCHASES as
(
select BUYER_ID
,STORE
,count(*) as PURCHASES
from T1
group by BUYER_ID, STORE
)
select STORE
,avg(PURCHASES) as average_purchases_per_store
from BUYER_PURCHASES
group by STORE
;
Output:
STORE | AVERAGE_PURCHASES_PER_STORE |
---|---|
East | 1.5 |
West | 2 |
South | 1 |
North | 1 |
Note that you don't need to use the distinct
keyword unless you have to filter out duplicate rows. If you do have duplicates, that should be addressed on ETL/ELT.
CodePudding user response:
Hopefully this is enough to get you started. There's literally thousands of possible approaches that depending on your datasets (you mentioned there's millions of rows) may provide you more flexibility or speed etc. High level approach would be to reduce the number of rows as quickly as possible. The first count distinct query should include as many predicates as you can to prevent any extra work. Hope this helps :-)
SELECT
STORE
,AVG(DISTINCT_STORE_PURCHASES) AVG_PURCHASES_PER_STORE
,AVG(DISTINCT_BUYER_PURCHASES) AVG_BUYER_PURCHASES_PER_STORE
FROM
(SELECT
STORE
, COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY BUYER_ID) DISTINCT_BUYER_PURCHASES
, DIV0(COUNT(DISTINCT PURCHASE_ID) OVER (PARTITION BY STORE), COUNT(DISTINCT BUYER_ID) OVER (PARTITION BY STORE) ) DISTINCT_STORE_PURCHASES
FROM CTE)
GROUP BY
STORE ;