I have a table where I save contacts data
Table "public.person"
Column | Type | Collation | Nullable | Default
------------------------------- -------------------------- ----------- ---------- ------------------------------------
id | integer | | not null | nextval('person_id_seq'::regclass)
full_name | character varying | | |
role | character varying | | |
first_name | character varying | | |
last_name | character varying | | |
linkedin_slug | character varying | | |
email | character varying | | |
domain | character varying | | |
created_at | timestamp with time zone | | | now()
updated_at | timestamp with time zone | | | now()
Indexes:
"pk_person" PRIMARY KEY, btree (id)
"ix_person_domain" btree (domain)
"ix_person_email" btree (email)
"ix_person_updated_at" btree (updated_at)
"uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)
I add data to this table from several sources. Some sources have Linkedin profiles data about people, other sources have email data. Sometimes the full names are not equal, even if they refer to the same person.
And I want to do upserts to not have duplicated data. For now I'm using the constraint on full_name, domain
. I know it's an oversimplification as there may be 2 different people with the same full name in the same company, but that's not a problem at this moment.
The problem comes when a person has different full names in the different data sources I use, but the same Linkedin profile, so I know it's the same person.
Or when they're associated to 2 domains from the same company.
In those cases, I end up with duplicated rows for some people. For example:
full_name | domain | linkedin_slug |
---|---|---|
Raffi SARKISSIAN | getlago.com | sarkissianraffi |
Raffi Sarkissian | getlago.com | sarkissianraffi |
That one is a trivial one that could be solved making the constraint on lower(full_name), domain
, but there are cases where the last name is not the same (people have more than 1 last name in many countries and they may not use them all sometimes).
Another example
full_name | domain | linkedin_slug |
---|---|---|
Amir Manji | tenjin.com | amirmanji |
Amir Manji | tenjin.io | amirmanji |
Ideally I'd like to be able to enforce more than 1 constraint at the same time in Postgres, but I've seen it's not easy or out-of-the-box. I don't/can't create a unique constraint on (full_name, domain, linkedin_slug)
. And the solution from the accepted answer is not so good for my use case because I have way more cols than in that example and I'd have to write a different upsert function for each data source (not all of them have the same attributes)
What I'm thinking is making a script to deduplicate the info 'manually' after inserting new data, but I'm not sure if there are better ways to address this.
How would you go about it?
CodePudding user response:
We cannot create unique constraints on functions of columns but we can create unique constraints on virtual columns which are functions of columns, for example LOWER()
. For domain I have created a virtual column with the part before the first point. This is then subject to a unique constraint.
NB: Virtual columns are supported in Postgres 12 or higher.
By these means we detect
- that 'Joe BLOGGS' is a duplicate of
Joe Bloggs
- that
hello.UK
duplicatehello.com
- that
invalid
is not an e-mail address.
Checking that an e-mail adresses is valid is complicated. A simple check that there is an@
followed by a point avoids telephone numbers etc.
You will have to determine which CONSTRAINTS are enforcable and which may stop you entering data which should be acceptable.
create table person ( id serial not null , full_name varchar(25) , role varchar(25) , first_name varchar(25) , last_name varchar(25) , linkedin_slug varchar(25) , email varchar(25) , domain varchar(25) , created_at timestamp default now() , updated_at timestamp default now() , domainRoot varchar(25) GENERATED ALWAYS AS ( LEFT(domain, STRPOS(domain,'.')-1)) STORED, l_fname varchar(25) GENERATED ALWAYS AS ( LOWER(full_name)) STORED, CONSTRAINT "pk_person" PRIMARY KEY (id), CONSTRAINT "uq_person_full_name_domain" UNIQUE (full_name, domain), CONSTRAINT "uq_lower_full_name" UNIQUE (l_fname), CONSTRAINT "uq_email" UNIQUE (email), CONSTRAINT "ck_valid_email" CHECK (email LIKE '%@%.%'), CONSTRAINT "uq_domain_root" UNIQUE (domainRoot) );
insert into person (full_name) values ('Joe Bloggs'),('Mrs Brown')
insert into person (full_name) values ('Joe BLOGGS')
ERROR: duplicate key value violates unique constraint "uq_lower_full_name"
DETAIL: Key (l_fname)=(joe bloggs) already exists.
update person set email = 'invalid', updated_at = now()
ERROR: new row for relation "person" violates check constraint "ck_valid_email"
DETAIL: Failing row contains (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928, null, joe bloggs).
update person set email = '[email protected]' where id = 2;
update person set domain = 'hello.com' , updated_at = now() where id = 1;
update person set domain = 'hello.UK' where id = 2
ERROR: duplicate key value violates unique constraint "uq_domain_root"
DETAIL: Key (domainroot)=(hello) already exists.
SELECT * FROM person
id | full_name | role | first_name | last_name | linkedin_slug | email | domain | created_at | updated_at | domainroot | l_fname -: | :--------- | :--- | :--------- | :-------- | :------------ | :--------- | :-------- | :------------------------- | :------------------------- | :--------- | :--------- 2 | Mrs Brown | null | null | null | null | [email protected] | null | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.749316 | null | mrs brown 1 | Joe Bloggs | null | null | null | null | null | hello.com | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.753807 | hello | joe bloggs
db<>fiddle here
CodePudding user response:
Update: I ended up doing it by first doing upserts enforcing full_name, domain
is unique, then deduplicating on linkedin_slug
running a script that basically groups by linkedin_slug
and gets whatever values are not null:
SELECT
max(id) id
, max(full_name) full_name
, max("role") "role"
, max(first_name) first_name
, max(last_name) last_name
, linkedin_slug
, max(linkedin_id) linkedin_id
, max(email) email
, max("domain") "domain"
, max(yc_bio) yc_bio
, min(created_at) created_at
, now() updated_at
, max(extrapolated_email_confidence) extrapolated_email_confidence
, max(email_status) email_status
, max(email_searched_on_apollo::text)::bool email_searched_on_apollo
FROM person
GROUP BY linkedin_slug
HAVING count(*) > 1
And then updating the original table with data from this subquery.
The full gist is here