Check table for similar text records and remove duplicate records (SQL Oracle database. )-CodePudding

I need to check an oracle database table for similar records and get the number of characters that differ. For this I use this query:

CREATE TABLE match_tab (
  id   NUMBER,
  col1 VARCHAR2(15),
  CONSTRAINT match_tab_pk PRIMARY  KEY (id)
);

INSERT INTO match_tab VALUES (1, 'Peter Parker');
INSERT INTO match_tab VALUES (2, 'peter parker');
INSERT INTO match_tab VALUES (3, 'Clark Kent' );
INSERT INTO match_tab VALUES (4, 'Wonder Woman');
INSERT INTO match_tab VALUES (5, 'Superman');
INSERT INTO match_tab VALUES (6, 'The Hulk');
INSERT INTO match_tab VALUES (7, 'Ponder Woman');
INSERT INTO match_tab VALUES (8, 'Claire Kent' );
INSERT INTO match_tab VALUES (9, 'Superman');
INSERT INTO match_tab VALUES (10, 'Iron Man');
COMMIT;

select  m1.col1 as mc1, 
        m2.col1 as mc2,
        UTL_MATCH.edit_distance(m1.col1, m2.col1) as match
from match_tab m1
join match_tab m2 on (1 = 1)
where UTL_MATCH.edit_distance(m1.col1, m2.col1) = 2 and m1.id != m2.id;

which gives the result:

   MC1             MC2                  MATCH
--------------- --------------- ----------
Peter Parker    peter parker             2
peter parker    Peter Parker             2
Clark Kent      Claire Kent              2
Claire Kent     Clark Kent               2

How can I remove duplicate records and get result like this:

MC1             MC2                  MATCH
--------------- --------------- ----------
Peter Parker    peter parker             2
Clark Kent      Claire Kent              2

Is it possible to optimize this query? For a large amount of data, it takes a long time.

I would be grateful for any help!

CodePudding user response：

Removing "duplicates" is simple (use e.g. < instead of !=):

SQL> SELECT m1.col1 AS mc1,
  2         m2.col1 AS mc2,
  3         UTL_MATCH.edit_distance (m1.col1, m2.col1) AS match
  4    FROM match_tab m1
  5         JOIN match_tab m2
  6            ON     UTL_MATCH.edit_distance (m1.col1, m2.col1) = 2
  7               AND m1.id < m2.id;

MC1             MC2                  MATCH
--------------- --------------- ----------
Peter Parker    peter parker             2
Clark Kent      Claire Kent              2

SQL>

How to make it faster? I don't know if you can make it any faster; you're comparing all rows, and - for large tables - it takes time.

I removed your cross-join (with on 1 = 1) as you actually can join m1 and m2 on utl_match.edit_distance, but I don't expect it to be a significant improvement to your code.

CodePudding user response：

You can:

use a hierarchical query so that you do not need to read the table twice and do not need to self-join;
add filters to check that the length of the columns is similar (which may be a simpler comparison than using UTL_MATCH and the optimiser may be able to exclude a lot of comparisons if it uses this filter first); and
filter on the PRIOR id < id rather than using != to compare the ids and eliminate the duplicates:

SELECT PRIOR col1 as mc1, 
       col1 as mc2,
       UTL_MATCH.edit_distance(PRIOR col1, col1) as match
FROM   match_tab
WHERE  LEVEL = 2
CONNECT BY
       LENGTH(PRIOR col1) - LENGTH(col1) BETWEEN -1 AND  1
AND    PRIOR id < id
AND    UTL_MATCH.edit_distance(PRIOR col1, col1) BETWEEN 1 AND 2;

Which, for your sample data, outputs:

MC1 MC2 MATCH

Peter Parker peter parker 2

Clark Kent Claire Kent 2

Wonder Woman Ponder Woman 1

MC1	MC2	MATCH
Peter Parker	peter parker	2
Clark Kent	Claire Kent	2
Wonder Woman	Ponder Woman	1

Note: if you only want matches where the edit-distance is 2 then change the last line to AND UTL_MATCH.edit_distance(PRIOR col1, col1) = 2;

db<>fiddle here