PostgreSQL: Merging sets of rows which text fields are contained in other sets of rows-CodePudding

Given the following table, I need to merge the fields in different "id" only if they are the same type (person or dog), and always as the value of every field of an "id" is contained in the values of other "ids".

id	being	feature	values
1	person	name	John;Paul
1	person	surname	Smith
2	dog	name	Ringo
3	dog	name	Snowy
4	person	name	John
4	person	surname
5	person	name	John;Ringo
5	person	surname	Smith

In this example, the merge results should be as follows:

1 and 4 (Since 4's name is present in 1's name and 4's surname is empty) 1 and 5 cannot be merged (the name field show different values) 4 and 5 can be merged 2 and 3 (dogs) cannot be merged. They have only the field "name" and they do not share values. 2 and 3 cannot be merged with 1, 4, 5 since they have different values in "being".

id	being	feature	values
1	person	name	John;Paul
1	person	surname	Smith
2	dog	name	Ringo
3	dog	name	Snowy
5	person	name	John;Ringo
5	person	surname	Smith

I have tried this:

UPDATE table a
SET values = (SELECT array_to_string(array_agg(distinct values),';') AS values FROM table b
           WHERE a.being= b.being
           AND a.feature= b.feature
           AND a.id<> b.id
           AND a.values LIKE '%'||a.values||'%'        
          )
WHERE (select count (*) FROM (SELECT DISTINCT c.being, c.id from table c where  a.being=c.being) as temp) >1
;

This doesn't work well because it will merge, for example, 1 and 5. Besides, it duplicates values when merging that field.

Cloud you please help me? I'd appreciate your help very much!

CodePudding user response：

One option is to aggregate names with surnames on "id" and "being". Once you get a single string per "id", a self join may find when a full name is completely included inside another (where the "being" is same for both "id"s), then you just select the smallest fullname, candidate for deletion:

WITH cte AS (
    SELECT id, 
           being,
           STRING_AGG(values, ';') AS fullname
    FROM tab
    GROUP BY id, 
             being
)
DELETE FROM tab
WHERE id IN (SELECT t2.id
             FROM       cte t1
             INNER JOIN cte t2
                     ON t1.being = t2.being
                    AND t1.id > t2.id
                    AND t1.fullname LIKE CONCAT('%',t2.fullname,'%'));

Check the demo here.