Ensuring no dupe ids in query return-CodePudding

So for the following schema:

CREATE TABLE activity (
    id integer NOT NULL,
    start_date date NOT NULL
);

CREATE TABLE account (
    id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE contact (
    id integer NOT NULL,
    account_id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE activity_contact (
    id integer NOT NULL,
    contact_id integer NOT NULL,
    activity_id integer NOT NULL
);

insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');

insert into account(id, name)
values
(1, 'Test Account');

insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');

insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);

You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query

SELECT contact.account_id AS accountid,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;

This returns:

accountid   last_month  prev_month
    1           3           1

However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?

CodePudding user response：

count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.
But there is more:

SELECT con.account_id AS accountid
     , count(DISTINCT activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
                                           AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP)) AS last_month
     , count(DISTINCT activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                                           AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
JOIN   contact          con ON con.id = aco.contact_id
-- JOIN   account          acc ON con.account_id = acc.id  -- noise
WHERE  act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND    act.start_date <  date_trunc('month', LOCALTIMESTAMP)
GROUP  BY con.account_id;

db<>fiddle here

Most importantly, add an outer WHERE clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
Make that filter "sargable", so it can use an index on (start_date) (unlike your original formulation). Again, big impact for a small selection from a big table.
Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
Assuming referential integrity (enforced with a FK constraint), the join to table account is just expensive noise. Drop it. CURRENT_DATE is not wrong. But since your expressions yield timestamp anyway, it's bit more efficient to use LOCALTIMESTAMP to begin with.

Compare with your original to see that this is quite a bit faster.

And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:

Ignoring time zones altogether in Rails and PostgreSQL

If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:

SELECT con.account_id AS accountid
     , date_trunc('month', act.start_date) AS mon
     , count(DISTINCT activity_id) AS dist_count
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
JOIN   contact          con ON con.id = aco.contact_id
-- JOIN   account          acc ON con.account_id = acc.id  -- noise
WHERE  act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND    act.start_date <  date_trunc('month', LOCALTIMESTAMP)
GROUP  BY 1, 2
ORDER  BY 1, 2 DESC;

db<>fiddle here