mysql / sql fixing rows in many to many database-CodePudding

I have a many to many table that has multiple rows that are duplicates due to an error in import.

Schema:

CREATE TABLE `actor` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `name` text NOT NULL,
 `pic_id` int(11) DEFAULT NULL,
 `dob` varchar(4) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `pic_id` (`pic_id`)
) ENGINE=InnoDB AUTO_INCREMENT=3195 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;

CREATE TABLE `video_actor` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `video_id` int(11) NOT NULL,
 `actor_id` int(11) NOT NULL,
 PRIMARY KEY (`id`),
 KEY `video_id` (`video_id`),
 KEY `actor_id` (`actor_id`),
 CONSTRAINT `video_actor_ibfk_2` FOREIGN KEY (`actor_id`) REFERENCES `actor` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=23757 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;

Fiddle

Specifically I have an actor table with

table actor
actor.id (pk)
actor.name

and I have duplicate actor names

they are linked through the table video_actor to video as so:

table video_actor
id (pk)
actor_id (fk)
video_id (fk)

table video
id (pk)
...etc

I tried for a few hours to make a SQL query to update the linking table (video_actor) with the correct id but I failed. The closest I feel I got is:

UPDATE video_actor as x
join 
(
   SELECT min_id,max_id
      FROM (
   SELECT name, MIN(id) as min_id,MAX(id) as max_id
      FROM actor
   GROUP BY name HAVING COUNT(name) > 1
   ) AS y
) as q on x.video_id = q.min_id
SET x.actor_id = q.min_id
WHERE x.actor_id = q.max_id;

the above query does nothing so I wrote some php to do what I couldn't do in pure sql:

public function remove_dup_actors() {
    $sql="
        SELECT name,min_id,max_id
        FROM (
            SELECT name, MIN(id) as min_id,MAX(id) as max_id
            FROM actor
            GROUP BY name HAVING COUNT(name) > 1
        ) as dups;
    ";
    $actor_dups=self::$dB->get_rows($sql);
    foreach($actor_dups as $actor) {
        $sql="UPDATE video_actor SET actor_id=${actor['min_id']} WHERE actor_id=${actor['max_id']}";
        self::$dB->update_row($sql);
        echo "Updated ${actor['name']}\n";
        $sql="DELETE FROM actor WHERE id=${actor['max_id']}";
        self::$dB->update_row($sql);
    }
}

this works.

Please experts, can someone tell me what is wrong with my pure sql (failed) solution?

UPDATE: Tried:

UPDATE video_actor as x
join 
(
   SELECT id,min_id,max_id
      FROM (
         SELECT id, name, MIN(id) as min_id,MAX(id) as max_id
         FROM actor
         GROUP BY name HAVING COUNT(name) > 1
      ) AS y
   ) as q on x.video_id = y.id
SET x.actor_id = q.min_id
WHERE x.actor_id = q.max_id;

and get unknown column y.id on ON clause

CodePudding user response：

It's not exactly clear from your question what the "correct id" is, but I'm assuming you want the lowest id of all rows with the same actor.name. This should do it:

UPDATE actor a
    JOIN video_actor v ON v.actor_id = a.id
    JOIN (
      SELECT MIN(id) as min_id, a2.name
      FROM actor a2
      GROUP BY a2.name
    ) a2 ON a2.name = a.name
SET v.actor_id = a2.min_id;

First we join the two tables on id/actor_id then we can update video_actor based on the first id belonging to a given name.

You'll need to deal with eliminating dupes with some separate queries. Here is an example of how to do it on the actor table by joining rows to the same subquery we used in above on name but which do not match min_id:

DELETE a
FROM actor a
    JOIN (
      SELECT MIN(id) as min_id, a2.name
      FROM actor a2
      GROUP BY a2.name
    ) a2 ON a2.name = a.name AND a.id != a2.min_id;

Also, I think you should probably be using VARCHAR for datatype of actor.name with an index - that will make this much faster.