I am using DSpace version 6.3 here. I discovered that I have created duplicate entries when performing a batch import. Using the SQL query from this answer, I managed to list all the duplicates in a given field. For this example, I am using the dc.subject
field (metadata_field_id=57
) to list items (dspace_object_id
) that have duplicate values in dc.subject
field.
Below is the query that I used:
SELECT metadata_value_id,
dspace_object_id,
text_value
FROM (SELECT *,
COUNT(*) OVER (PARTITION BY dspace_object_id, text_value) AS cnt
FROM metadatavalue where metadata_field_id=57) e
WHERE cnt > 1
Below is the sample list generated from that query:
metadata_value_id | dspace_object_id | text_value |
---|---|---|
503018 | 13f07109-7797-4d5b-a8bd-1f9e91a2433d | pompanos |
503021 | 13f07109-7797-4d5b-a8bd-1f9e91a2433d | pompanos |
503217 | 233d1e67-b698-4e90-8e70-a175776b2d80 | pests |
503219 | 233d1e67-b698-4e90-8e70-a175776b2d80 | pests |
83574 | 47753988-fc2a-4416-b20d-acbff6e256de | Penaeus monodon |
10800 | 47753988-fc2a-4416-b20d-acbff6e256de | Penaeus monodon |
531520 | 50965923-bc65-4fdf-af61-2c8debdfe057 | Penaeus monodon |
531521 | 50965923-bc65-4fdf-af61-2c8debdfe057 | Penaeus monodon |
483882 | 57d0544c-1825-431a-acf9-eb835c24920b | development |
483879 | 57d0544c-1825-431a-acf9-eb835c24920b | development |
Based on the table above, you can see that there are 2 occurrences (others have more than 2) of a text_value
in the same item (dspace_object_id
).
My question is how can I delete the duplicate but retain the first occurrence?
In the table above, rows with the following metadata_value_id
should be deleted:
503021
503219
83574 <-- This should be deleted instead of 10800 because 10800 was put in (inserted) first.
531521
483882 <-- Should be deleted instead of 483879 because of the same reason above.
I'm hoping to do this via SQL query alone because what I have done before is import this list to a Google spreadsheet, remove the duplicates there using a Remove Duplicates add-on, download it as a CSV, and then use that CSV to update the metadatavalue
table.
CodePudding user response:
Assuming that the 'first occurrence' means the one with smaller metadata_value_id:
delete from metadatavalue as Not1stOcc
where exists (select *
from metadatavalue as FirstOcc
where FirstOcc.text_value = Not1stOcc.text_value
and FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id
)
should work
CodePudding user response:
For my future reference, I modified tinazmu's answer that worked for my use case.
DELETE FROM metadatavalue as Not1stOcc
WHERE EXISTS (SELECT metadata_value_id, dspace_object_id, text_value, metadata_field_id
FROM metadatavalue AS FirstOcc
WHERE FirstOcc.text_value = Not1stOcc.text_value
AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id
AND FirstOcc.dspace_object_id = Not1stOcc.dspace_object_id
AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id
AND metadata_field_id = 57
AND FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id
)