I was asked to pull information about three different types of clients in the last year (visited once, visited <10 times, and visited over 10 times) see if the likelihood of them returning compared to a few different factors.
For this reason, I created a pretty broad query. Currently I have a joined query of three tables: client information, visit information, and staff information. I created a calculated column in my select statement:
COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
Now I just need to group by totalvisits and filter by date they visited.
I tried:
where visitdate> 01/01/2021
group by totalvisits
having total visits<10
But I get an error that the visitno is not a valid group by expression.
What might I be doing wrong?
CodePudding user response:
In snowflake, you can use the QUALIFY clause to filter window functions post window aggregation.
So, the query would look like this:
SELECT
clientid,
COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
FROM <your_table>
WHERE visitdate >= 2021-01-01::date
AND visitdate < 2022-01-01::date
QUALIFY totalvisits < 10;
*Make sure that visitdate
has a date type beforehand, though!
CodePudding user response:
Ok, so lets make some fake data, and do the count thing:
WITH fake_data(client_id, visit_date) as (
SELECT * FROM VALUES
-- this person has visted once
(1, '2022-04-14'::date),
-- this person has visited 3 timw in the year
(3, '2022-04-13'::date),
(3, '2022-03-13'::date),
(3, '2022-02-13'::date),
-- this person is a huge vistor, but 1 is outside the with in last year.
(5, '2022-04-12'::date),
(5, '2022-03-12'::date),
(5, '2022-02-12'::date),
(5, '2022-01-12'::date),
(5, '2020-02-12'::date)
)
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
boom:
CLIENT_ID | VISIT_DATE | TOTAL_VISITS |
---|---|---|
1 | 2022-04-14 | 1 |
3 | 2022-04-13 | 3 |
3 | 2022-03-13 | 3 |
3 | 2022-02-13 | 3 |
5 | 2022-04-12 | 4 |
5 | 2022-03-12 | 4 |
5 | 2022-02-12 | 4 |
Now to make those into those thee group/categories.
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits,
case
when total_visits = 1 then 1
when total_visits <= 3 then 2
when total_visits > 3 then 3
end as group_id
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
Now some math, of which I will wrap that into a sub-select (but also push a couple things down into it)
WITH fake_data(client_id, visit_date) as (
SELECT * FROM VALUES
-- this person has visted once
(1, '2022-04-14'::date),
-- this person has visited 3 timw in the year
(3, '2022-04-13'::date),
(3, '2022-04-11'::date),
(3, '2022-04-09'::date),
-- this person is a huge vistor, but 1 is outside the with in last year.
(5, '2022-04-12'::date),
(5, '2022-03-12'::date),
(5, '2022-02-12'::date),
(5, '2022-01-12'::date),
(5, '2020-02-12'::date)
)
SELECT group_id
,count(distinct client_id) as count_of_group_members
,sum(total_visits) as sum_of_group_visit
,avg(visit_gap_in_days) as avg_group_day_diff
,stddev(visit_gap_in_days) as stddev_group_day_diff
FROM (
SELECT *,
count(distinct visit_date) over (partition by client_id) as total_visits,
case
when total_visits = 1 then 1
when total_visits <= 3 then 2
when total_visits > 3 then 3
end as group_id,
lag(visit_date) over (partition by client_id order by visit_date) as prior_visit_date,
datediff('day', prior_visit_date, visit_date) as visit_gap_in_days
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
)
GROUP BY 1
ORDER BY 1
GROUP_ID | COUNT_OF_GROUP_MEMBERS | SUM_OF_GROUP_VISIT | AVG_GROUP_DAY_DIFF | STDDEV_GROUP_DAY_DIFF |
---|---|---|---|---|
1 | 1 | 1 | ||
2 | 1 | 9 | 2 | 0 |
3 | 1 | 16 | 30 | 1.732050808 |
Wozers, that sum of visits is wrong, I have summed my sums..
So here given the count(distinct visitno)
I cannot sum that, as it becomes the sum of sums, AND I cannot do a count(*) because we have just noticed there are duplicates (otherwise the distinct is not needed). And I assume you have not stripped the rows down, as there is some "other details that you will want"
But anyways. This is the great things about SQL, you can answer anything, but you have to know the Question, and know the Data so you can know which assumptions can be held true for your data.