Home > Software design >  How to delete duplicate entries from a specific field by SQL query in DSpace?
How to delete duplicate entries from a specific field by SQL query in DSpace?

Time:05-05

I am using DSpace version 6.3 here. I discovered that I have created duplicate entries when performing a batch import. Using the SQL query from this answer, I managed to list all the duplicates in a given field. For this example, I am using the dc.subject field (metadata_field_id=57) to list items (dspace_object_id) that have duplicate values in dc.subject field.

Below is the query that I used:

SELECT metadata_value_id,
       dspace_object_id,       
       text_value
FROM   (SELECT *,
       COUNT(*) OVER (PARTITION BY dspace_object_id, text_value) AS cnt
FROM   metadatavalue where metadata_field_id=57) e
WHERE  cnt > 1

Below is the sample list generated from that query:

metadata_value_id dspace_object_id text_value
503018 13f07109-7797-4d5b-a8bd-1f9e91a2433d pompanos
503021 13f07109-7797-4d5b-a8bd-1f9e91a2433d pompanos
503217 233d1e67-b698-4e90-8e70-a175776b2d80 pests
503219 233d1e67-b698-4e90-8e70-a175776b2d80 pests
83574 47753988-fc2a-4416-b20d-acbff6e256de Penaeus monodon
10800 47753988-fc2a-4416-b20d-acbff6e256de Penaeus monodon
531520 50965923-bc65-4fdf-af61-2c8debdfe057 Penaeus monodon
531521 50965923-bc65-4fdf-af61-2c8debdfe057 Penaeus monodon
483882 57d0544c-1825-431a-acf9-eb835c24920b development
483879 57d0544c-1825-431a-acf9-eb835c24920b development

Based on the table above, you can see that there are 2 occurrences (others have more than 2) of a text_value in the same item (dspace_object_id).

My question is how can I delete the duplicate but retain the first occurrence? In the table above, rows with the following metadata_value_id should be deleted:

503021  
503219  
83574 <-- This should be deleted instead of 10800 because 10800 was put in (inserted) first.  
531521
483882 <-- Should be deleted instead of 483879 because of the same reason above.

I'm hoping to do this via SQL query alone because what I have done before is import this list to a Google spreadsheet, remove the duplicates there using a Remove Duplicates add-on, download it as a CSV, and then use that CSV to update the metadatavalue table.

CodePudding user response:

Assuming that the 'first occurrence' means the one with smaller metadata_value_id:

delete from metadatavalue as Not1stOcc
where exists (select * 
              from metadatavalue as FirstOcc 
              where FirstOcc.text_value = Not1stOcc.text_value
                and FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id
            )

should work

CodePudding user response:

For my future reference, I modified tinazmu's answer that worked for my use case.

DELETE FROM metadatavalue as Not1stOcc
WHERE EXISTS (SELECT metadata_value_id, dspace_object_id, text_value, metadata_field_id
    FROM metadatavalue AS FirstOcc 
    WHERE FirstOcc.text_value = Not1stOcc.text_value
      AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id
      AND FirstOcc.dspace_object_id = Not1stOcc.dspace_object_id
      AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id
      AND metadata_field_id = 57
      AND FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id
            )
  • Related