I want to compare and get number of corresposndeces between two arrays in PostgresSQL.
For example: given the table
id | array |
---|---|
1 | a, b, d |
2 | a, e |
3 | a, b, d, e |
4 | b, c, d |
I would like to obtain a comparison table:
id_1 | id_2 | number_of_matches |
---|---|---|
1 | 2 | 1 |
1 | 3 | 3 |
1 | 4 | 2 |
2 | 3 | 2 |
2 | 4 | 0 |
3 | 4 | 2 |
There is no repeated values in each array, and the order of items inside arrays does not matter.
I found here about the overlap operator (&&) but it only returns a boolean if there is or not any correspondence between two arrays, but not the number of correspondences.
CREATE TABLE data (id integer, val text[]);
INSERT INTO data VALUES (1, '{a,b,d}');
INSERT INTO data VALUES (2, '{a,e}');
INSERT INTO data VALUES (3, '{a,b,d,e}');
INSERT INTO data VALUES (4, '{b,c,d}');
SELECT a.id id_1, b.id id_2, a.val && b.val match
FROM data a, data b
WHERE a.id < b.id;
returns
id_1 | id_2 | match |
---|---|---|
1 | 2 | t |
1 | 3 | t |
1 | 4 | t |
2 | 3 | t |
2 | 4 | f |
3 | 4 | t |
CodePudding user response:
WITH normalized AS ( -- 2
SELECT
id,
unnest(val) as array_element -- 1
FROM data
)
SELECT
n1.id,
n2.id,
COUNT(*)
FROM normalized n1
JOIN normalized n2 -- 3
ON (n1.array_element = n2.array_element AND n1.id <> n2.id)
GROUP BY n1.id, n2.id -- 4
ORDER BY n1.id, n2.id
- Normalize your data (note: You should think about normalizing your data directly in the table; means: no arrays) by creating one row per array element
- Use a CTE to "store" your normalized result
- Self-join your normalized result (but not the same
id
s) at thearray_element
GROUP
andCOUNT
CodePudding user response:
https://www.db-fiddle.com/f/sobBPcfXY7h8j7y6tKXfDj/0
get the intersection of the 2 arrays an count the elements of it:
SELECT
data.id,
data2.id as id2,
data.val,
data2.val val2,
ARRAY( SELECT * FROM UNNEST( data.val ) WHERE UNNEST = ANY( data2.val )) as intersection,
cardinality(ARRAY( SELECT * FROM UNNEST( data.val ) WHERE UNNEST = ANY( data2.val ))) as number_of_matches
from data,data as data2
where data.id <> data2.id