Home > Back-end >  Find duplicates in table without primary key
Find duplicates in table without primary key

Time:11-18

I have a table with three columns. The first column, defined as a varchar type, is the client's number (c_number), the other two are also of type varchar - the client's first and last name (c_first_name and c_last_name).

The table does not have a separate primary key. But there are duplicates with the same c_number column (for example, for the same clients, but first and last names are written in different registers).

The number of rows in the table is quite large - about a million.

It is necessary to display all clients whose number is duplicated more than N times.

Here is the code that solves the problem head-on:

SELECT *
    COUNT(c_number) AS c
GROUP BY 
    c_number
HAVING 
    (c > N) ;

But this approach is seems to be very slow. My solution is to add index:

CREATE INDEX idx_c_number 
    ON TABLENAME(c_number);

After that, if I understand correctly, it will be possible to execute expressions to search for duplicates in real time.

If this is still not the case, please tell me the best solution to find duplicates with the best performance (bearing in mind the fact that we still do not have a primary key).

CodePudding user response:

Actually, "a million records" isn't considered "large" anymore ...

Yes, an index would enable you to use a query such as the one you described, with reasonable improvement in speed. "But, at a cost ..."

While some people consider it "necessary" or at least "customary" to have a primary key on a table, it's really up to you. Some tables, for example, are basically just "event or transaction logs," and you are never going to use a "primary key" to refer to them. If you will never use one, you don't need one. It will not impact "indexing" either way.

Now, by creating this index, you will introduce a long-term cost: the index must henceforth be maintained. This will introduce an additional cost to anything and everything which inserts or deletes a record, or which updates the indexed value. So, while this will of course make the present query "faster," it isn't free. "Plan accordingly." Only you can decide what's best for you. Because of the larger implications, I encourage you to discuss the matter with your colleagues before proceeding.

  • Related