Delete rows having exactly same values-CodePudding

Please consider following table

Table Name: mytable

model_id	event_name	time_of_event
9	CREATE	2016-01-01 00:00:00
9	UPDATE	2016-01-01 01:00:00
9	DELETE	2016-01-01 02:00:00
3	CREATE	2016-01-01 03:00:00 `DUPLICATE`
3	CREATE	2016-01-01 03:00:00 `DUPLICATE` delete this
3	DELETE	2016-01-01 04:00:00

How to delete 5th entry from above table i.e. delete row with exactly same value from table. In above example no column is unique.

Please keep in mind that database could be huge and I don't want to recreate or republish data with distinct values into the table.

// Use below code to create above example 

CREATE TABLE mytable(
  model_id  integer,
  event_name varchar(7),
  time_of_event timestamp
);

INSERT INTO mytable 
    (model_id, event_name, time_of_event) 
    VALUES
    (9, 'CREATE', '2016-01-01 00:00:00'),
    (9, 'UPDATE', '2016-01-01 01:00:00'),
    (9, 'DELETE', '2016-01-01 02:00:00'), 
    (3, 'CREATE', '2016-01-01 03:00:00'),
    (3, 'CREATE', '2016-01-01 03:00:00'),
    (3, 'DELETE', '2016-01-01 04:00:00');
    
SELECT * FROM mytable;

CodePudding user response：

Having two or more rows with identical values is a sign of very bad design. I suppose model_id is the table's primary key. I wonder how did you end up in this situation.

I don't want to recreate or republish data with distinct values into the table.

One possible solution is to add (not recreate/republish) a column with unique values to your table, then delete the duplicate rows you want.

ALTER TABLE mytable ADD COLUMN MyTableID INT;

You need to fill this column with unique values:

SET @i := 0;
UPDATE MyTable SET MyTableID = @i:=(@i 1) WHERE 1=1;

Next, you can write the following query:

SELECT
  MT.MyTableID, MT.model_id, MT.event_name, MT.time_of_event
FROM
  MyTable MT,
  (SELECT model_id, event_name, time_of_event, COUNT(*)
   FROM MyTable
   GROUP BY model_id, event_name, time_of_event
   HAVING COUNT(*) > 1
  ) TmpTable
WHERE 
  MT.model_id = TmpTable.model_id
  AND MT.event_name = TmpTable.event_name
  AND MT.time_of_event = TmpTable.time_of_event
;

Result:

MyTableID	model_id	event_name	time_of_event
4	3	CREATE	2016-01-01 03:00:00
5	3	CREATE	2016-01-01 03:00:00

You can now proceed with the deletion of duplicate rows:

DELETE FROM MyTable WHERE MyTableID IN (5 /*, the ones you wish */);

If you have too many duplicate values and you can't afford to delete them manually, you can do it like this:

DELETE FROM MyTable WHERE MyTableID IN (/* The previous long query */) LIMIT /* The number of rows of the previous long query - 1 */;

The -1 is to preserve one row of the duplicates. If you want to delete them all, remove the LIMIT clause.

CodePudding user response：

Try with a helper table that contains the duplicates, but only one each:

With this scenario: ...

CREATE TABLE mytable(
model_id integer,event_name VARCHAR(8),time_of_event TIMESTAMP)
; 

INSERT INTO mytable
-- your input data ...
          SELECT 9,'CREATE',TIMESTAMP '2016-01-01 00:00:00'
UNION ALL SELECT 9,'UPDATE',TIMESTAMP '2016-01-01 01:00:00'
UNION ALL SELECT 9,'DELETE',TIMESTAMP '2016-01-01 02:00:00'
UNION ALL SELECT 3,'CREATE',TIMESTAMP '2016-01-01 03:00:00'
UNION ALL SELECT 3,'CREATE',TIMESTAMP '2016-01-01 03:00:00'
UNION ALL SELECT 3,'DELETE',TIMESTAMP '2016-01-01 04:00:00'
;

Create your helper table like so:

CREATE TABLE helper AS
SELECT
  model_id
, event_name
, time_of_event
FROM mytable
GROUP BY 
  model_id
, event_name
, time_of_event
HAVING COUNT(*) > 1;

Then, use the helper table to delete ... you will delete all rows, not only one of the duplicates ...

DELETE FROM mytable
WHERE(model_id,event_name,time_of_event) IN (
SELECT model_id,event_name,time_of_event FROM helper
);

And finally, insert all the rows from the helper table back in again:

INSERT INTO mytable
SELECT * FROM helper;                                                                                                                                                                              
COMMIT; -- if your connection is not auto-commit ...

But I'd like to add that, for most database systems, the other approach - to create a new table containing SELECT DISTINCT * FROM old_table is the faster alternative as soon as we are talking about around 20 to 25 % of the total row count.