shared item in a node's property in Neo4j-CodePudding

I have a Neo4j database. I have nodes that each have a list property; let's call it variant_list. (Contains a list of strings.) Here is an example of 3 nodes in JSON format:

[
{
  "identity": 1,
  "properties": {
    "variant_list": ["xx","yy", "zzz" ],
    "name": "First"
  }
},
{
  "identity": 2,
  "properties": {
    "variant_list": ["xx","pp", "ww" ],
    "name": "Second"
  }
},
{
  "identity": 3,
  "properties": {
    "variant_list": ["nn","pp", "ll" ],
    "name": "Third"
  }
}
]

I would like to write a query in cypher that gets the pairs (1,2) and (3,2) because (1 and 2) share the xx string, and (2 and 3) share pp in their variant_list.

My database has 2 million nodes. So I want to consider the performance.

CodePudding user response：

You don't have a choice but to a cartesian product of each node to another node. Then check that a is not the same with b and any item in a.variant_list is found in b.variant_list.

apoc.periodic.iterate will create a batch of 10k rows of node "a" then match it with node "b". However, iterate works only when you create nodes or relationships so either you create a node "Result" or do a relationship connecting a to b. In my example below, I create a new node Result to store the pairs.

CALL apoc.periodic.iterate(
"MATCH (a:TestNode) RETURN a as row",
"WITH row
MATCH (b:TestNode) WHERE row < b
AND ANY(i IN row.variant_list WHERE i IN b.variant_list)
WITH row, b
CREATE (r:Result) SET r.pairs = [row.name, b.name]",
 {batchSize:10000, parallel:true, retries:0});

Result:
╒════════════════════════════╕
│"result"                    │
╞════════════════════════════╡
│{"pairs":["First","Second"]}│
├────────────────────────────┤
│{"pairs":["Second","Third"]}│
└────────────────────────────┘

If you want to create a relationship between the nodes, you can replace the CREATE with this command

MERGE (row)-[:PAIRS_WITH]->(b)