Enforce 2 unique constraints on upserts in Postgres-CodePudding

I have a table where I save contacts data

                                                Table "public.person"
            Column             |           Type           | Collation | Nullable |              Default               
------------------------------- -------------------------- ----------- ---------- ------------------------------------
 id                            | integer                  |           | not null | nextval('person_id_seq'::regclass)
 full_name                     | character varying        |           |          | 
 role                          | character varying        |           |          | 
 first_name                    | character varying        |           |          | 
 last_name                     | character varying        |           |          | 
 linkedin_slug                 | character varying        |           |          | 
 email                         | character varying        |           |          | 
 domain                        | character varying        |           |          | 
 created_at                    | timestamp with time zone |           |          | now()
 updated_at                    | timestamp with time zone |           |          | now()
Indexes:
    "pk_person" PRIMARY KEY, btree (id)
    "ix_person_domain" btree (domain)
    "ix_person_email" btree (email)
    "ix_person_updated_at" btree (updated_at)
    "uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)

I add data to this table from several sources. Some sources have Linkedin profiles data about people, other sources have email data. Sometimes the full names are not equal, even if they refer to the same person.

And I want to do upserts to not have duplicated data. For now I'm using the constraint on full_name, domain. I know it's an oversimplification as there may be 2 different people with the same full name in the same company, but that's not a problem at this moment.

The problem comes when a person has different full names in the different data sources I use, but the same Linkedin profile, so I know it's the same person.

Or when they're associated to 2 domains from the same company.

In those cases, I end up with duplicated rows for some people. For example:

full_name	domain	linkedin_slug
Raffi SARKISSIAN	getlago.com	sarkissianraffi
Raffi Sarkissian	getlago.com	sarkissianraffi

That one is a trivial one that could be solved making the constraint on lower(full_name), domain, but there are cases where the last name is not the same (people have more than 1 last name in many countries and they may not use them all sometimes).

Another example

full_name	domain	linkedin_slug
Amir Manji	tenjin.com	amirmanji
Amir Manji	tenjin.io	amirmanji

Ideally I'd like to be able to enforce more than 1 constraint at the same time in Postgres, but I've seen it's not easy or out-of-the-box. I don't/can't create a unique constraint on (full_name, domain, linkedin_slug). And the solution from the accepted answer is not so good for my use case because I have way more cols than in that example and I'd have to write a different upsert function for each data source (not all of them have the same attributes)

What I'm thinking is making a script to deduplicate the info 'manually' after inserting new data, but I'm not sure if there are better ways to address this.

How would you go about it?

CodePudding user response：

We cannot create unique constraints on functions of columns but we can create unique constraints on virtual columns which are functions of columns, for example LOWER(). For domain I have created a virtual column with the part before the first point. This is then subject to a unique constraint.
NB: Virtual columns are supported in Postgres 12 or higher. By these means we detect

that 'Joe BLOGGS' is a duplicate of Joe Bloggs
that hello.UK duplicate hello.com
that invalidis not an e-mail address.
Checking that an e-mail adresses is valid is complicated. A simple check that there is an @ followed by a point avoids telephone numbers etc.
You will have to determine which CONSTRAINTS are enforcable and which may stop you entering data which should be acceptable.

create table person (
 id                             serial     not null       ,
 full_name                        varchar(25)                 ,
 role                             varchar(25)                 ,
 first_name                       varchar(25)                 ,
 last_name                        varchar(25)                 ,
 linkedin_slug                    varchar(25)                 ,
 email                            varchar(25)                 ,
 domain                           varchar(25)                 ,
 created_at                        timestamp  default  now()  ,
 updated_at                        timestamp  default  now()  ,
 domainRoot varchar(25) GENERATED ALWAYS AS ( LEFT(domain, STRPOS(domain,'.')-1)) STORED,
 l_fname varchar(25) GENERATED ALWAYS AS ( LOWER(full_name)) STORED,
 CONSTRAINT "pk_person" PRIMARY KEY (id),
 CONSTRAINT  "uq_person_full_name_domain" UNIQUE (full_name, domain),
 CONSTRAINT  "uq_lower_full_name" UNIQUE (l_fname),
 CONSTRAINT "uq_email" UNIQUE (email),
 CONSTRAINT "ck_valid_email" CHECK (email LIKE '%@%.%'),
 CONSTRAINT "uq_domain_root" UNIQUE (domainRoot)
);

insert into person (full_name) values ('Joe Bloggs'),('Mrs Brown')

insert into person (full_name) values ('Joe BLOGGS')

ERROR:  duplicate key value violates unique constraint "uq_lower_full_name"

DETAIL: Key (l_fname)=(joe bloggs) already exists.

update person set email = 'invalid', updated_at = now()

ERROR:  new row for relation "person" violates check constraint "ck_valid_email"

DETAIL: Failing row contains (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928, null, joe bloggs).

update person set email = '[email protected]' where id = 2;

update person set domain = 'hello.com' , updated_at = now() where id = 1;

update person set domain = 'hello.UK' where id = 2

ERROR:  duplicate key value violates unique constraint "uq_domain_root"

DETAIL: Key (domainroot)=(hello) already exists.

SELECT * FROM person

id | full_name  | role | first_name | last_name | linkedin_slug | email      | domain    | created_at                 | updated_at                 | domainroot | l_fname   
-: | :--------- | :--- | :--------- | :-------- | :------------ | :--------- | :-------- | :------------------------- | :------------------------- | :--------- | :---------
 2 | Mrs Brown  | null | null       | null      | null          | [email protected] | null      | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.749316 | null       | mrs brown 
 1 | Joe Bloggs | null | null       | null      | null          | null       | hello.com | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.753807 | hello      | joe bloggs

db<>fiddle here

CodePudding user response：

Update: I ended up doing it by first doing upserts enforcing full_name, domain is unique, then deduplicating on linkedin_slug running a script that basically groups by linkedin_slug and gets whatever values are not null:

    SELECT
        max(id) id
        , max(full_name) full_name
        , max("role") "role"
        , max(first_name) first_name
        , max(last_name) last_name
        , linkedin_slug
        , max(linkedin_id) linkedin_id
        , max(email) email
        , max("domain") "domain"
        , max(yc_bio) yc_bio
        , min(created_at) created_at
        , now() updated_at
        , max(extrapolated_email_confidence) extrapolated_email_confidence
        , max(email_status) email_status
        , max(email_searched_on_apollo::text)::bool email_searched_on_apollo
    FROM person
    GROUP BY linkedin_slug
    HAVING count(*) > 1

And then updating the original table with data from this subquery.

The full gist is here