Find matching persons from one table with another (MS SQL Server)-CodePudding

I have two tables:

table "Person"

ID          FirstName  LastName
----------- ---------- ----------
1           Janez      Novak
2           Matija     Špacapan
3           Francka    Joras

Table "UserList"

ID    FullName
----- --------------------
1     Andrej Novak
2     Novak Peter Janez
3     Jana Novak
4     Andrej Kosir
5     Jan Balon
6     Francka Joras
7     France Joras

As a result, the query must return those IDs from both tables, that FirstName and Lastname from table Person exist in table UserList. The name and Lastname must be precisely the same. FullName in table UserList can include the middle name - which should be "ignored".

Match: Janez Novak = Janez Novak OR Novak Janez OR Janez Peter Novak

Not a match: Janez Novak <> Janeza Novak OR Jjanez Novak

Wanted results:

ID   FirstName  LastName  ID   WholeName
---- ---------- --------- ---- -------------------
1    Janez      Novak     2    Novak Peter Janez
3    Francka    Joras     6    Francka Joras

This is my query:

SELECT 
    A.ID
    ,A.FirstName
    ,A.LastName
    ,B.ID
    ,B.WholeName
FROM    
    dbo.UserList B
    cross join dbo.Person A 
WHERE   
    (                                                
    CHARINDEX('"' A.FirstName '"', '"' Replace(B.WholeName,' ','"') '"') > 0
     AND CHARINDEX('"' A.LastName '"', '"' Replace(B.WholeName,' ','"') '"') > 0 
    )

The query works OK when there are not many records in the tables.

But my tables have: "Person" -> 400k and "UserList" -> 14k records.

Is my approach to finding a solution OK, or is there any other more efficient way to do that? Thank you.

CodePudding user response：

Your schema is broken :p

There are various heuristis for doing the matching, but I expect you'll be able to find counterexamples to break whatever you try. For example what about the four people: Peter Smith, Pete Smith, Peter Smithson, and Pete Smithson?

Here's a %LIKE% approach, which I'd expect to be slow.

SELECT p.ID, p.FirstName, p.LastName, u.ID, u.FullName,
    CASE WHEN COUNT(*) OVER (PARTITION BY p.ID) > 1 THEN 0 ELSE 1 END AS MatchIsUnique
FROM Person p
    INNER JOIN UserList u
        ON u.FullName LIKE p.FirstName   '%'
        AND u.LastName LIKE '%'   p.LastName

Here's a string manipulation approach based on the assumption that the space character is the delimiter.

SELECT p.ID, p.FirstName, p.LastName, u.ID, u.FullName,
    CASE WHEN COUNT(*) OVER (PARTITION BY p.ID) > 1 THEN 0 ELSE 1 END AS MatchIsUnique
FROM Person p
    INNER JOIN UserList u
        ON p.FirstName = SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname))
        AND p.LastName = SUBSTRING(@FullName, LEN(@FullName) - CHARINDEX(' ', REVERSE(@Fullname)) 1, CHARINDEX(' ', REVERSE(@Fullname)))

Probably also quite slow. Maybe you could speed it up by adding

SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname)) and
SUBSTRING(@FullName, LEN(@FullName) - CHARINDEX(' ', REVERSE(@Fullname)) 1, CHARINDEX(' ', REVERSE(@Fullname)))

as computed columns and indexing them.

CodePudding user response：

Create tables

create table persons (
  id int IDENTITY(1,1) PRIMARY KEY,
  FirstName nvarchar(32) NOT NULL,
  LastName nvarchar(32) NOT NULL
);

create table users (
  id int IDENTITY(1,1) PRIMARY KEY,
  FullName nvarchar(32) NOT NULL
);

Sample data

INSERT INTO persons (FirstName, LastName)
values
('Janez','Novak'),
('Matija','Špacapan'),
('Francka','Joras');

INSERT INTO users (FullName)
VALUES
('Andrej Novak'),
('Novak Peter Janez'),
('Jana Novak'),
('Andrej Kosir'),
('Jan Balon'),
('Francka Joras'),
('France Joras');

Query (matching names)

SELECT p.id, p.FirstName, p.LastName, u.id as user_id, u.FullName
FROM persons p, users u
WHERE
  u.name LIKE CONCAT(p.fname, '%', p.lname)
  OR
  u.name LIKE CONCAT(p.lname, '%', p.fname)

Output

Running example SQL Fiddle

Above example link is of MySQL & the code is working fine on SQL server