Home > Enterprise >  Convert a table of edge descriptions to a list of arrays containing trees
Convert a table of edge descriptions to a list of arrays containing trees

Time:02-22

I've got two tables in Snowflake, which define a bunch of graphs. They look like this:

Node Table:

node    family   <other descriptive columns>  
A       1
B       1
C       1
D       1
E       1
F       1
G       1
H       1
I       2
J       2

etc...

Edge Table:

to   from
A        B
B        C
B        D
D        E
F        G
G        H

etc...

A family describes a collection of one or more related graphs; some of these may be a single graph, others may be several small graphs etc. The graphs are unidirectional and may have several branches/forks. There is currently no label for each of these graphs (that is, it needs to be inferred from the connections).

I'd like to create a table with a list of graphs in it, such as:

family   id    structure
1        1     [A -> B -> C, B -> D -> E]
1        2     [F -> G -> H] 
2 ... etc...

Where each row represents a set of connected nodes and the structure of that set and id is a new label for the graph object.

I've made some progress using recursive CTEs but hit a wall. Any help highly appreciated!

CodePudding user response:

- Extract the two tables into your favorite programming language.
- Create a set of families
- Iterate F over family set
    - Iterate over rows in node table
         - If node belongs to F
              - Add node to vector
    - Construct empty adjacency matrix ( square 2D vector of size node vector size
    - Iterate N1 over node vector
          - Iterate N2 != N1 over node vector
                - IF N1-N2 in edge table
                       - Add to adjacency matrix
    - Iterate N1 over node vector
          - IF N1 has zero in-edges
               - Output tree rooted at N1
         

CodePudding user response:

Firstly we need to build the recursions.

WITH nodes(node, family) AS (
    SELECT * 
    FROM VALUES
        ('A',1),
        ('B',1),
        ('C',1),
        ('D',1),
        ('E',1),
        ('F',1),
        ('G',1),
        ('H',1),
        ('I',2),
        ('J',2)
), edges(_from, _to) AS (
    SELECT * 
    FROM VALUES
        ('A','B'),
        ('B','C'),
        ('B','D'),
        ('D','E'),
        ('F','G'),
        ('G','H')
), recursive_wrapper AS (
    WITH recur AS (
        -- Anchor Clause
        select
            family, 
            node, 
            node as tail, 
            array_construct(node) path, 
            1 as depth
        from nodes
        
        union all

        -- Recursive Clause
        select r.family,
            r.node,
            e._to as tail,
            ARRAY_APPEND(r.path, e._to) as path,
            r.depth   1 as depth 
        from recur AS r
        join edges AS e
            on r.tail = e._from
    )
    SELECT * 
    FROM recur
)
SELECT * 
FROM recursive_wrapper
WHERE node = 'A'

gives:

FAMILY NODE TAIL PATH DEPTH
1 A A [ "A" ] 1
1 A B [ "A", "B" ] 2
1 A C [ "A", "B", "C" ] 3
1 A D [ "A", "B", "D" ] 3
1 A E [ "A", "B", "D", "E" ] 4

Then we need to prune the sublimated rows (in above we want to remove depth 1,2,4 as they are sub parts of another row), we can do this by swapping from an array to a string for path

), recursive_wrapper AS (
    WITH recur AS (
        -- Anchor Clause
        select
            family, 
            node, 
            node as tail, 
            node::text path, /* this cast it to max size, verse the largest size of the input */
            1 as depth
        from nodes
        
        union all

        -- Recursive Clause
        select r.family,
            r.node,
            e._to as tail,
            r.path || '_' || e._to as path,
            r.depth   1 as depth 
        from recur AS r
        join edges AS e
            on r.tail = e._from
    )
    SELECT * 
    FROM recur
)
SELECT a.*
    ,b.path as b_path
    ,b.depth as b_depth
    ,startswith(b.path, a.path) as starts
    ,sum(iff(starts,1,0)) over(partition by a.family, a.node, a.path) as dominated_count
FROM recursive_wrapper as a
LEFT JOIN recursive_wrapper as b 
    ON a.family = b.family AND a.node = b.node AND a.depth < b.depth
WHERE a.node = 'A'
ORDER BY 1,2,4;

gives:

FAMILY NODE TAIL PATH DEPTH B_PATH B_DEPTH STARTS DOMINATED_COUNT
1 A A A 1 A_B 2 TRUE 4
1 A A A 1 A_B_C 3 TRUE 4
1 A A A 1 A_B_D 3 TRUE 4
1 A A A 1 A_B_D_E 4 TRUE 4
1 A B A_B 2 A_B_C 3 TRUE 3
1 A B A_B 2 A_B_D 3 TRUE 3
1 A B A_B 2 A_B_D_E 4 TRUE 3
1 A C A_B_C 3 A_B_D_E 4 FALSE 0
1 A D A_B_D 3 A_B_D_E 4 TRUE 1
1 A E A_B_D_E 4 0

so we want to just keep those with a count of zero, which we can do in a QUALIFY

SELECT a.*
    --,b.path as b_path
    --,b.depth as b_depth
    --,startswith(b.path, a.path) as starts
    --,sum(iff(starts,1,0)) over(partition by a.family, a.node, a.path) as dominated_count

FROM recursive_wrapper as a
LEFT JOIN recursive_wrapper as b 
    ON a.family = b.family AND a.node = b.node AND a.depth < b.depth
WHERE a.node = 'A'
QUALIFY sum(iff(startswith(b.path, a.path),1,0)) over(partition by a.family, a.node, a.path) = 0
ORDER BY 1,2,4;

gives:

FAMILY NODE TAIL PATH DEPTH
1 A C A_B_C 3
1 A E A_B_D_E 4

This will explode with loops, which we can protect against by put the array back, and checking for the value already being in the array

WITH nodes(node, family) AS (
    SELECT * 
    FROM VALUES
        ('A',1),
        ('B',1),
        ('C',1),
        ('D',1),
        ('E',1),
        ('F',1),
        ('G',1),
        ('H',1),
        ('I',2),
        ('J',2)
), edges(_from, _to) AS (
    SELECT * 
    FROM VALUES
        ('A','B'),
        ('B','C'),
        ('B','D'),
        ('D','E'),
        ('F','G'),
        ('G','H')
), recursive_wrapper AS (
    WITH recur AS (
        -- Anchor Clause
        select
            family, 
            node, 
            node as tail, 
            node::text path, /* this cast it to max size, verse the largest size of the input */
            array_construct(node) a_path, 
            1 as depth
        from nodes
        
        union all

        -- Recursive Clause
        select r.family,
            r.node,
            e._to as tail,
            r.path || '_' || e._to as path,
            ARRAY_APPEND(r.a_path, e._to) as a_path,
            r.depth   1 as depth 
        from recur AS r
        join edges AS e
            on r.tail = e._from AND ARRAY_CONTAINS( e._to::variant , r.a_path ) = FALSE
    )
    SELECT * 
    FROM recur
)
SELECT a.*
    --,b.path as b_path
    --,b.depth as b_depth
    --,startswith(b.path, a.path) as starts
    --,sum(iff(starts,1,0)) over(partition by a.family, a.node, a.path) as dominated_count

FROM recursive_wrapper as a
LEFT JOIN recursive_wrapper as b 
    ON a.family = b.family AND a.node = b.node AND a.depth < b.depth
WHERE a.node = 'A'
QUALIFY sum(iff(startswith(b.path, a.path),1,0)) over(partition by a.family, a.node, a.path) = 0
ORDER BY 1,2,4;
  • Related