Home > OS >  Get ID's of Mysql Duplicate Rows to Delete
Get ID's of Mysql Duplicate Rows to Delete

Time:07-07

I'm trying to write a Laravel eloquent statement to do the following.

Query a table and get all the ID's of all the duplicate rows (or ideally all the IDs except the ID of the first instance of the duplicate).

Right now I have the following mysql statement:

select `codes`, count(`codes`) as `occurrences`, `customer_id` from `pizzas` 
group by `codes`, `customer_id` 
having `occurrences` > 1;

The duplicates are any row that shares a combination of codes and customer_id, example:

codes,customer_id
183665A4,3
183665A4,3
183665A4,3
183665A4,3
183665A4,3

I'm trying to delete all but 1 of those.

This is returning a set of the codes, with their occurrences and their customer_id, as I only want rows that have both.

Currently I think loop through this, and save the ID of the first instance, and then call this again and delete any without that ID. This seems not very fast, as there's about 50 million rows so each query takes forever and we have multiple queries for each duplicate to delete.

// get every order that shares the same code and customer ID
$orders = Order::select('id', 'codes', DB::raw('count(`codes`) as `occurrences`'), 'customer_id')
            ->groupBy('codes')
            ->groupBy('customer_id')
            ->having('occurrences', '>', 1)
            ->limit(100)
            ->get();

        // loop through those orders
        foreach ($orders as $order)
        {
            // find the first order that matches this duplicate set
            $first_order = Order::where('codes', $order->codes)
                                ->where('customer_id', $order->customer_id)
                                ->first();
            
            // delete all but the first        
            Order::where('codes', $order->codes)
                ->where('customer_id', $order->customer_id)
                ->where('id', '!=', $first_order->id)
                ->delete();
        }

There has got to be a more efficient way to track down all rows that share the same code and customer_id, and delete all the duplicates but keep the first instance, right? lol

I'm thinking maybe if I can add a fake column to the results that is an array of every ID, I could at least then remove the first ID and delete the others.

CodePudding user response:

Don't involve PHP

This seems not very fast

The logic in the question is inherently slow because it's lots of queries and for each query there's:

  • DB<->PHP network roundtrip
  • PHP ORM logic/overhead

Given the numbers in the question, the whole code needs calling up to 10k times (if there are exactly 2 occurrences for every one of those 2 million duplicate records), for arguments sake let's say there are 1k sets of duplicates, overall that's:

  • 1,000 queries finding duplicates
  • 100,000 queries finding the first record
  • 100,000 delete queries

201,000 queries is a lot and the php overhead makes it an order of magnitude slower (a guess, based on experience).

Do it directly on the DB

Just eliminating php/orm/network (even if it's on the same machine) time would make the process markedly faster, that would involve writing a procedure to mimic the php logic in the question.

But there's a simpler way, the specifics depend on the circumstances. In comments you've said:

  • The table is 140GB in size
  • It contains 50 million rows
  • Approx 2 million are duplicate records
  • There isn't enough free space to make a copy of the table

Taking these comments at face value the process I suggest is:

Ensure you have a functional DB backup

Before doing anything make sure you have a functional DB backup. If you manage to make a mistake and e.g. drop the table - be sure you can recover without loss of data.

You'll be testing this process on a copy of the database first anyway, right :) ?

Create a table of "ids to keep" and populate it

This is a permutation of removing duplicate with a unique index:

CREATE TABLE ids_to_keep (
    id INT PRIMARY KEY,
    codes VARCHAR(50) NOT NULL, # use same schema as source table
    customer_id INT NOT NULL, # use same schema as source table
    UNIQUE KEY derp (codes,customer_id)
);

INSERT IGNORE INTO ids_to_keep 
    SELECT id, codes, customer_id from pizzas;

Mysql will silently drop the rows conflicting with the unique index, resulting in a table with one id per codes customer_id tuple.

If you don't have space for this table - make room :). It shouldn't be too large; 140GB and 50M rows means each row is approx 3kb - this temporary table will likely require single-digit % of the original size.

Delete the duplicate records

Before executing any expected-to-be-slow query use EXPLAIN to check if the query will complete in a reasonable amount of time.

To run as a single query:

DELETE FROM 
  pizzas 
WHERE 
  id NOT IN (SELECT id from ids_to_keep);

If you wish to do things in chunks:

DELETE FROM 
  pizzas 
WHERE 
  id BETWEEN (0,10000) AND
  id NOT IN (SELECT id from ids_to_keep);

Cleanup

Once the table isn't needed any more, get rid of it:

DROP TABLE ids_to_keep;

Make sure this doesn't happen again

To prevent this happening again, add a unique index to the table:

CREATE UNIQUE INDEX ON pizzas(codes, customer_id);

CodePudding user response:

Try this one it will keep only the duplicate and non-duplicate id lastest id:

   $deleteDuplicates = DB::table('orders as ord1') 
  ->join('orders as ord2', 'ord1.codes', '<', 'ord2.codes') 
   ->where('ord1.codes', '=', 'ord2.codes') ->delete();
  • Related