I have a table with emails and date. I want to compare 2022's emails against 2021's emails. I got the code down below. Pretty straight forward. But now I want to add 2022's emails to the "bucket" as time passes. So Jan-2022 emails will be compared to 2021's emails, Feb-2022 emails will be compared to 2021's emails Jan-2022, and so on.
Any helpful advice would be greatly appreciated.
SELECT T1.date
,IFF(T1.email = T2.email,'TRUE','FALSE') "Logic"
,SUM(CASE WHEN "Logic" = 'FALSE' THEN 1 ELSE 0 END) "New email"
,SUM(CASE WHEN "Logic" = 'TRUE' THEN 1 ELSE 0 END) "Repeated email"
FROM T1
LEFT JOIN
(SELECT DISTINCT email
,date
FROM T1
WHERE
AND "date" >= '2021-01-01'
AND "date" <= '2021-12-31') T2
ON T1.email = T2.email
WHERE T1.date >= '2022-01-01'
AND T1.date <= '2022-12-31'
GROUP BY 1,2);
CodePudding user response:
If I understand you, you want to count the number of emails per day; those that were new and those that you had seen before on a prior day.
If this is right, I would take a completely different approach and use the count window/analytic function for each email:
select
email, date,
count (*) over (partition by email order by date) as email_count
from t1
This will tell you for each email how many times total (chronologically) it has occurred, which means a count of 1 means it's the first time and anything else means it's a repeat.
From there, you can just do a normal grouping:
with foo as (
select
email, date,
count (*) over (partition by email order by date) as email_count
from t1
)
select
date,
sum (case when email_count > 1 then 1 else 0 end) as repeat_count,
sum (case when email_count = 1 then 1 else 0 end) as new_count
from foo
group by date
order by date
You can apply a date filter in the final query, provided the CTE is always all data.
CodePudding user response:
It's not clear what you're working with as the SQL statement is incomplete / invalid (mismatching parenthesis), and the data granularity is also unclear - is it daily, or monthly data?
But I think the problem you are having is that you're only determining "new" emails based on whether you saw the email address last year, where you need to determine based on whether you've seen the address at all, e.g. yesterday or any day prior.
To do this you'll want to
- avoid filtering the
T2 "list of emails we've seen"
subquery by date, and - use a conditional date clause when joining
T2
back toT1
Like this
SELECT T1.date
,IFF(T1.email = T2.email,'TRUE','FALSE') "Logic"
,SUM(CASE WHEN "Logic" = 'FALSE' THEN 1 ELSE 0 END) "New email"
,SUM(CASE WHEN "Logic" = 'TRUE' THEN 1 ELSE 0 END) "Repeated email"
FROM T1
LEFT JOIN
(SELECT DISTINCT email
,date
FROM T1
WHERE
-- remove these:
-- AND "date" >= '2021-01-01'
-- AND "date" <= '2021-12-31'
) T2
ON T1.email = T2.email
WHERE T1.date >= '2022-01-01'
AND T1.date <= '2022-12-31'
-- add this:
AND T1.date > T2.date
GROUP BY 1,2);
The key here is that the join condition doesn't need to be an equality join; it can be any expression or function that evaluates to a boolean TRUE/FALSE. So for every email and date, you can compare it to every matching email on a previous (historical) date.