Home > Mobile >  How to efficiently find duplicate database entries (HSQL)
How to efficiently find duplicate database entries (HSQL)

Time:06-30

I have a large table of names. Each entry has a unique ID, a FORENAME, and a SURNAME. If different IDs have the same forename and surname that is not necessarily an error, but it often is, so I want a query that lists the suspects. Because the table is large I don't want to do this as a join, or rather: I don't want the query to be quadratic in table size. If I were coding this in a functional programming language, and my table was just a list of triples, I would do this as follows (in loglinear time):

  • sort by surname
  • group neighbouring elements with same surname
  • throw out the singleton groups [which can be done without using the length function]
  • within those groups (map), sort by forename, then group, then throw out singletons again
  • flatten those groups (twice) to get back to list of triples

How would I do this as an SQL query? The above may not be a good algo to run on DBs, but then what would be? I am not too fussed about the precise format of the output, as that output is just for human consumption. I should add that my database is in OpenOffice, so it is based on HSQL.

CodePudding user response:

You use GROUP BY and HAVING in SQL to do what you want. For example:

SELECT FIRSTNAME, LASTNAME FROM CUSTOMER GROUP BY LASTNAME, FIRSTNAME HAVING COUNT(*) > 1 ORDER BY LASTNAME

Note the ORDER BY clause is optional.

  • Related