I've got two tables in Snowflake, which define a bunch of graphs. They look like this:
Node Table:
node family <other descriptive columns>
A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 2
J 2
etc...
Edge Table:
to from
A B
B C
B D
D E
F G
G H
etc...
A family describes a collection of one or more related graphs; some of these may be a single graph, others may be several small graphs etc. The graphs are unidirectional and may have several branches/forks. There is currently no label for each of these graphs (that is, it needs to be inferred from the connections).
I'd like to create a table with a list of graphs in it, such as:
family id structure
1 1 [A -> B -> C, B -> D -> E]
1 2 [F -> G -> H]
2 ... etc...
Where each row represents a set of connected nodes and the structure of that set and id is a new label for the graph object.
I've made some progress using recursive CTEs but hit a wall. Any help highly appreciated!
CodePudding user response:
- Extract the two tables into your favorite programming language.
- Create a set of families
- Iterate F over family set
- Iterate over rows in node table
- If node belongs to F
- Add node to vector
- Construct empty adjacency matrix ( square 2D vector of size node vector size
- Iterate N1 over node vector
- Iterate N2 != N1 over node vector
- IF N1-N2 in edge table
- Add to adjacency matrix
- Iterate N1 over node vector
- IF N1 has zero in-edges
- Output tree rooted at N1
CodePudding user response:
Firstly we need to build the recursions.
WITH nodes(node, family) AS (
SELECT *
FROM VALUES
('A',1),
('B',1),
('C',1),
('D',1),
('E',1),
('F',1),
('G',1),
('H',1),
('I',2),
('J',2)
), edges(_from, _to) AS (
SELECT *
FROM VALUES
('A','B'),
('B','C'),
('B','D'),
('D','E'),
('F','G'),
('G','H')
), recursive_wrapper AS (
WITH recur AS (
-- Anchor Clause
select
family,
node,
node as tail,
array_construct(node) path,
1 as depth
from nodes
union all
-- Recursive Clause
select r.family,
r.node,
e._to as tail,
ARRAY_APPEND(r.path, e._to) as path,
r.depth 1 as depth
from recur AS r
join edges AS e
on r.tail = e._from
)
SELECT *
FROM recur
)
SELECT *
FROM recursive_wrapper
WHERE node = 'A'
gives:
FAMILY | NODE | TAIL | PATH | DEPTH |
---|---|---|---|---|
1 | A | A | [ "A" ] | 1 |
1 | A | B | [ "A", "B" ] | 2 |
1 | A | C | [ "A", "B", "C" ] | 3 |
1 | A | D | [ "A", "B", "D" ] | 3 |
1 | A | E | [ "A", "B", "D", "E" ] | 4 |
Then we need to prune the sublimated rows (in above we want to remove depth 1,2,4 as they are sub parts of another row), we can do this by swapping from an array to a string for path
), recursive_wrapper AS (
WITH recur AS (
-- Anchor Clause
select
family,
node,
node as tail,
node::text path, /* this cast it to max size, verse the largest size of the input */
1 as depth
from nodes
union all
-- Recursive Clause
select r.family,
r.node,
e._to as tail,
r.path || '_' || e._to as path,
r.depth 1 as depth
from recur AS r
join edges AS e
on r.tail = e._from
)
SELECT *
FROM recur
)
SELECT a.*
,b.path as b_path
,b.depth as b_depth
,startswith(b.path, a.path) as starts
,sum(iff(starts,1,0)) over(partition by a.family, a.node, a.path) as dominated_count
FROM recursive_wrapper as a
LEFT JOIN recursive_wrapper as b
ON a.family = b.family AND a.node = b.node AND a.depth < b.depth
WHERE a.node = 'A'
ORDER BY 1,2,4;
gives:
FAMILY | NODE | TAIL | PATH | DEPTH | B_PATH | B_DEPTH | STARTS | DOMINATED_COUNT |
---|---|---|---|---|---|---|---|---|
1 | A | A | A | 1 | A_B | 2 | TRUE | 4 |
1 | A | A | A | 1 | A_B_C | 3 | TRUE | 4 |
1 | A | A | A | 1 | A_B_D | 3 | TRUE | 4 |
1 | A | A | A | 1 | A_B_D_E | 4 | TRUE | 4 |
1 | A | B | A_B | 2 | A_B_C | 3 | TRUE | 3 |
1 | A | B | A_B | 2 | A_B_D | 3 | TRUE | 3 |
1 | A | B | A_B | 2 | A_B_D_E | 4 | TRUE | 3 |
1 | A | C | A_B_C | 3 | A_B_D_E | 4 | FALSE | 0 |
1 | A | D | A_B_D | 3 | A_B_D_E | 4 | TRUE | 1 |
1 | A | E | A_B_D_E | 4 | 0 |
so we want to just keep those with a count of zero, which we can do in a QUALIFY
SELECT a.*
--,b.path as b_path
--,b.depth as b_depth
--,startswith(b.path, a.path) as starts
--,sum(iff(starts,1,0)) over(partition by a.family, a.node, a.path) as dominated_count
FROM recursive_wrapper as a
LEFT JOIN recursive_wrapper as b
ON a.family = b.family AND a.node = b.node AND a.depth < b.depth
WHERE a.node = 'A'
QUALIFY sum(iff(startswith(b.path, a.path),1,0)) over(partition by a.family, a.node, a.path) = 0
ORDER BY 1,2,4;
gives:
FAMILY | NODE | TAIL | PATH | DEPTH |
---|---|---|---|---|
1 | A | C | A_B_C | 3 |
1 | A | E | A_B_D_E | 4 |
This will explode with loops, which we can protect against by put the array back, and checking for the value already being in the array
WITH nodes(node, family) AS (
SELECT *
FROM VALUES
('A',1),
('B',1),
('C',1),
('D',1),
('E',1),
('F',1),
('G',1),
('H',1),
('I',2),
('J',2)
), edges(_from, _to) AS (
SELECT *
FROM VALUES
('A','B'),
('B','C'),
('B','D'),
('D','E'),
('F','G'),
('G','H')
), recursive_wrapper AS (
WITH recur AS (
-- Anchor Clause
select
family,
node,
node as tail,
node::text path, /* this cast it to max size, verse the largest size of the input */
array_construct(node) a_path,
1 as depth
from nodes
union all
-- Recursive Clause
select r.family,
r.node,
e._to as tail,
r.path || '_' || e._to as path,
ARRAY_APPEND(r.a_path, e._to) as a_path,
r.depth 1 as depth
from recur AS r
join edges AS e
on r.tail = e._from AND ARRAY_CONTAINS( e._to::variant , r.a_path ) = FALSE
)
SELECT *
FROM recur
)
SELECT a.*
--,b.path as b_path
--,b.depth as b_depth
--,startswith(b.path, a.path) as starts
--,sum(iff(starts,1,0)) over(partition by a.family, a.node, a.path) as dominated_count
FROM recursive_wrapper as a
LEFT JOIN recursive_wrapper as b
ON a.family = b.family AND a.node = b.node AND a.depth < b.depth
WHERE a.node = 'A'
QUALIFY sum(iff(startswith(b.path, a.path),1,0)) over(partition by a.family, a.node, a.path) = 0
ORDER BY 1,2,4;