So for the following schema:
CREATE TABLE activity (
id integer NOT NULL,
start_date date NOT NULL
);
CREATE TABLE account (
id integer NOT NULL,
name varchar NOT NULL
);
CREATE TABLE contact (
id integer NOT NULL,
account_id integer NOT NULL,
name varchar NOT NULL
);
CREATE TABLE activity_contact (
id integer NOT NULL,
contact_id integer NOT NULL,
activity_id integer NOT NULL
);
insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');
insert into account(id, name)
values
(1, 'Test Account');
insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');
insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);
You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query
SELECT contact.account_id AS accountid,
count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;
This returns:
accountid last_month prev_month
1 3 1
However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?
CodePudding user response:
count(DISTINCT activity_id)
to fold duplicates in the count, like Edouard suggested.
But there is more:
SELECT con.account_id AS accountid
, count(DISTINCT activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)) AS last_month
, count(DISTINCT activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM activity act
JOIN activity_contact aco ON aco.activity_id = act.id
JOIN contact con ON con.id = aco.contact_id
-- JOIN account acc ON con.account_id = acc.id -- noise
WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)
GROUP BY con.account_id;
db<>fiddle here
Most importantly, add an outer
WHERE
clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.Make that filter "sargable", so it can use an index on
(start_date)
(unlike your original formulation). Again, big impact for a small selection from a big table.Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
Assuming referential integrity (enforced with a FK constraint), the join to table
account
is just expensive noise. Drop it.CURRENT_DATE
is not wrong. But since your expressions yieldtimestamp
anyway, it's bit more efficient to useLOCALTIMESTAMP
to begin with.
Compare with your original to see that this is quite a bit faster.
And I assume you are aware that this query introduces a dependency on the TimeZone
setting of the executing session. The current date depends on where in the world we ask. See:
If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:
SELECT con.account_id AS accountid
, date_trunc('month', act.start_date) AS mon
, count(DISTINCT activity_id) AS dist_count
FROM activity act
JOIN activity_contact aco ON aco.activity_id = act.id
JOIN contact con ON con.id = aco.contact_id
-- JOIN account acc ON con.account_id = acc.id -- noise
WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
db<>fiddle here