I have a many to many table that has multiple rows that are duplicates due to an error in import.
Schema:
CREATE TABLE `actor` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` text NOT NULL,
`pic_id` int(11) DEFAULT NULL,
`dob` varchar(4) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pic_id` (`pic_id`)
) ENGINE=InnoDB AUTO_INCREMENT=3195 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;
CREATE TABLE `video_actor` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`video_id` int(11) NOT NULL,
`actor_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `video_id` (`video_id`),
KEY `actor_id` (`actor_id`),
CONSTRAINT `video_actor_ibfk_2` FOREIGN KEY (`actor_id`) REFERENCES `actor` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=23757 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;
Specifically I have an actor table with
table actor
actor.id (pk)
actor.name
and I have duplicate actor names
they are linked through the table video_actor to video as so:
table video_actor
id (pk)
actor_id (fk)
video_id (fk)
table video
id (pk)
...etc
I tried for a few hours to make a SQL query to update the linking table (video_actor) with the correct id but I failed. The closest I feel I got is:
UPDATE video_actor as x
join
(
SELECT min_id,max_id
FROM (
SELECT name, MIN(id) as min_id,MAX(id) as max_id
FROM actor
GROUP BY name HAVING COUNT(name) > 1
) AS y
) as q on x.video_id = q.min_id
SET x.actor_id = q.min_id
WHERE x.actor_id = q.max_id;
the above query does nothing so I wrote some php to do what I couldn't do in pure sql:
public function remove_dup_actors() {
$sql="
SELECT name,min_id,max_id
FROM (
SELECT name, MIN(id) as min_id,MAX(id) as max_id
FROM actor
GROUP BY name HAVING COUNT(name) > 1
) as dups;
";
$actor_dups=self::$dB->get_rows($sql);
foreach($actor_dups as $actor) {
$sql="UPDATE video_actor SET actor_id=${actor['min_id']} WHERE actor_id=${actor['max_id']}";
self::$dB->update_row($sql);
echo "Updated ${actor['name']}\n";
$sql="DELETE FROM actor WHERE id=${actor['max_id']}";
self::$dB->update_row($sql);
}
}
this works.
Please experts, can someone tell me what is wrong with my pure sql (failed) solution?
UPDATE: Tried:
UPDATE video_actor as x
join
(
SELECT id,min_id,max_id
FROM (
SELECT id, name, MIN(id) as min_id,MAX(id) as max_id
FROM actor
GROUP BY name HAVING COUNT(name) > 1
) AS y
) as q on x.video_id = y.id
SET x.actor_id = q.min_id
WHERE x.actor_id = q.max_id;
and get unknown column y.id on ON clause
CodePudding user response:
It's not exactly clear from your question what the "correct id" is, but I'm assuming you want the lowest id of all rows with the same actor.name
. This should do it:
UPDATE actor a
JOIN video_actor v ON v.actor_id = a.id
JOIN (
SELECT MIN(id) as min_id, a2.name
FROM actor a2
GROUP BY a2.name
) a2 ON a2.name = a.name
SET v.actor_id = a2.min_id;
First we join the two tables on id/actor_id
then we can update video_actor
based on the first id belonging to a given name
.
You'll need to deal with eliminating dupes with some separate queries. Here is an example of how to do it on the actor
table by joining rows to the same subquery we used in above on name
but which do not match min_id
:
DELETE a
FROM actor a
JOIN (
SELECT MIN(id) as min_id, a2.name
FROM actor a2
GROUP BY a2.name
) a2 ON a2.name = a.name AND a.id != a2.min_id;
Also, I think you should probably be using VARCHAR
for datatype of actor.name
with an index - that will make this much faster.