Home > database >  Filtering a Query based on a Date and Window function in Snowflake
Filtering a Query based on a Date and Window function in Snowflake

Time:04-14

I was asked to pull information about three different types of clients in the last year (visited once, visited <10 times, and visited over 10 times) see if the likelihood of them returning compared to a few different factors.

For this reason, I created a pretty broad query. Currently I have a joined query of three tables: client information, visit information, and staff information. I created a calculated column in my select statement:

COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits

Now I just need to group by totalvisits and filter by date they visited.

I tried:

where visitdate> 01/01/2021
group by totalvisits
having total visits<10

But I get an error that the visitno is not a valid group by expression.

What might I be doing wrong?

CodePudding user response:

In snowflake, you can use the QUALIFY clause to filter window functions post window aggregation.

So, the query would look like this:

SELECT
  clientid,
  COUNT(DISTINCT visitno) OVER(PARTITION BY clientid) as totalvisits
FROM <your_table>
WHERE visitdate >= 2021-01-01::date
  AND visitdate < 2022-01-01::date
QUALIFY totalvisits < 10;

*Make sure that visitdate has a date type beforehand, though!

CodePudding user response:

Ok, so lets make some fake data, and do the count thing:

WITH fake_data(client_id, visit_date) as (
    SELECT * FROM VALUES
    -- this person has visted once
    (1, '2022-04-14'::date),
    -- this person has visited 3 timw in the year
    (3, '2022-04-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-02-13'::date),
    -- this person is a huge vistor, but 1 is outside the with in last year.
    (5, '2022-04-12'::date),
    (5, '2022-03-12'::date),
    (5, '2022-02-12'::date),
    (5, '2022-01-12'::date),
    (5, '2020-02-12'::date)
)
SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)

boom:

CLIENT_ID VISIT_DATE TOTAL_VISITS
1 2022-04-14 1
3 2022-04-13 3
3 2022-03-13 3
3 2022-02-13 3
5 2022-04-12 4
5 2022-03-12 4
5 2022-02-12 4

Now to make those into those thee group/categories.

SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits,
    case 
        when total_visits = 1 then 1
        when total_visits <= 3 then 2
        when total_visits > 3 then 3
    end as group_id
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)

Now some math, of which I will wrap that into a sub-select (but also push a couple things down into it)

WITH fake_data(client_id, visit_date) as (
    SELECT * FROM VALUES
    -- this person has visted once
    (1, '2022-04-14'::date),
    -- this person has visited 3 timw in the year
    (3, '2022-04-13'::date),
    (3, '2022-04-11'::date),
    (3, '2022-04-09'::date),
    -- this person is a huge vistor, but 1 is outside the with in last year.
    (5, '2022-04-12'::date),
    (5, '2022-03-12'::date),
    (5, '2022-02-12'::date),
    (5, '2022-01-12'::date),
    (5, '2020-02-12'::date)
)
SELECT group_id
    ,count(distinct client_id) as count_of_group_members
    ,sum(total_visits) as sum_of_group_visit
    ,avg(visit_gap_in_days) as avg_group_day_diff
    ,stddev(visit_gap_in_days) as stddev_group_day_diff
FROM (
SELECT *,
    count(distinct visit_date) over (partition by client_id) as total_visits,
    case 
        when total_visits = 1 then 1
        when total_visits <= 3 then 2
        when total_visits > 3 then 3
    end as group_id,
    lag(visit_date) over (partition by client_id order by visit_date) as prior_visit_date,
    datediff('day', prior_visit_date, visit_date) as visit_gap_in_days
FROM fake_data
WHERE visit_date >= dateadd('year', -1, '2022-04-14' /* CURRENT_DATE */)
)
GROUP BY 1
ORDER BY 1
GROUP_ID COUNT_OF_GROUP_MEMBERS SUM_OF_GROUP_VISIT AVG_GROUP_DAY_DIFF STDDEV_GROUP_DAY_DIFF
1 1 1
2 1 9 2 0
3 1 16 30 1.732050808

Wozers, that sum of visits is wrong, I have summed my sums..

So here given the count(distinct visitno) I cannot sum that, as it becomes the sum of sums, AND I cannot do a count(*) because we have just noticed there are duplicates (otherwise the distinct is not needed). And I assume you have not stripped the rows down, as there is some "other details that you will want"

But anyways. This is the great things about SQL, you can answer anything, but you have to know the Question, and know the Data so you can know which assumptions can be held true for your data.

  • Related