I need to check an oracle database table for similar records and get the number of characters that differ. For this I use this query:
CREATE TABLE match_tab (
id NUMBER,
col1 VARCHAR2(15),
CONSTRAINT match_tab_pk PRIMARY KEY (id)
);
INSERT INTO match_tab VALUES (1, 'Peter Parker');
INSERT INTO match_tab VALUES (2, 'peter parker');
INSERT INTO match_tab VALUES (3, 'Clark Kent' );
INSERT INTO match_tab VALUES (4, 'Wonder Woman');
INSERT INTO match_tab VALUES (5, 'Superman');
INSERT INTO match_tab VALUES (6, 'The Hulk');
INSERT INTO match_tab VALUES (7, 'Ponder Woman');
INSERT INTO match_tab VALUES (8, 'Claire Kent' );
INSERT INTO match_tab VALUES (9, 'Superman');
INSERT INTO match_tab VALUES (10, 'Iron Man');
COMMIT;
select m1.col1 as mc1,
m2.col1 as mc2,
UTL_MATCH.edit_distance(m1.col1, m2.col1) as match
from match_tab m1
join match_tab m2 on (1 = 1)
where UTL_MATCH.edit_distance(m1.col1, m2.col1) = 2 and m1.id != m2.id;
which gives the result:
MC1 MC2 MATCH
--------------- --------------- ----------
Peter Parker peter parker 2
peter parker Peter Parker 2
Clark Kent Claire Kent 2
Claire Kent Clark Kent 2
How can I remove duplicate records and get result like this:
MC1 MC2 MATCH --------------- --------------- ---------- Peter Parker peter parker 2 Clark Kent Claire Kent 2
Is it possible to optimize this query? For a large amount of data, it takes a long time.
I would be grateful for any help!
CodePudding user response:
Removing "duplicates" is simple (use e.g. <
instead of !=
):
SQL> SELECT m1.col1 AS mc1,
2 m2.col1 AS mc2,
3 UTL_MATCH.edit_distance (m1.col1, m2.col1) AS match
4 FROM match_tab m1
5 JOIN match_tab m2
6 ON UTL_MATCH.edit_distance (m1.col1, m2.col1) = 2
7 AND m1.id < m2.id;
MC1 MC2 MATCH
--------------- --------------- ----------
Peter Parker peter parker 2
Clark Kent Claire Kent 2
SQL>
How to make it faster? I don't know if you can make it any faster; you're comparing all rows, and - for large tables - it takes time.
I removed your cross-join (with on 1 = 1
) as you actually can join m1
and m2
on utl_match.edit_distance
, but I don't expect it to be a significant improvement to your code.
CodePudding user response:
You can:
- use a hierarchical query so that you do not need to read the table twice and do not need to self-join;
- add filters to check that the length of the columns is similar (which may be a simpler comparison than using
UTL_MATCH
and the optimiser may be able to exclude a lot of comparisons if it uses this filter first); and - filter on the
PRIOR id < id
rather than using!=
to compare theid
s and eliminate the duplicates:
SELECT PRIOR col1 as mc1,
col1 as mc2,
UTL_MATCH.edit_distance(PRIOR col1, col1) as match
FROM match_tab
WHERE LEVEL = 2
CONNECT BY
LENGTH(PRIOR col1) - LENGTH(col1) BETWEEN -1 AND 1
AND PRIOR id < id
AND UTL_MATCH.edit_distance(PRIOR col1, col1) BETWEEN 1 AND 2;
Which, for your sample data, outputs:
MC1 MC2 MATCH Peter Parker peter parker 2 Clark Kent Claire Kent 2 Wonder Woman Ponder Woman 1
Note: if you only want matches where the edit-distance is 2
then change the last line to AND UTL_MATCH.edit_distance(PRIOR col1, col1) = 2;
db<>fiddle here