I have a Neo4j database.
I have nodes that each have a list property; let's call it variant_list
. (Contains a list of strings.)
Here is an example of 3 nodes in JSON format:
[
{
"identity": 1,
"properties": {
"variant_list": ["xx","yy", "zzz" ],
"name": "First"
}
},
{
"identity": 2,
"properties": {
"variant_list": ["xx","pp", "ww" ],
"name": "Second"
}
},
{
"identity": 3,
"properties": {
"variant_list": ["nn","pp", "ll" ],
"name": "Third"
}
}
]
I would like to write a query in cypher that gets the pairs (1,2) and (3,2) because (1 and 2) share the xx
string, and (2 and 3) share pp
in their variant_list
.
My database has 2 million nodes. So I want to consider the performance.
CodePudding user response:
You don't have a choice but to a cartesian product of each node to another node. Then check that a is not the same with b and any item in a.variant_list is found in b.variant_list.
apoc.periodic.iterate will create a batch of 10k rows of node "a" then match it with node "b". However, iterate works only when you create nodes or relationships so either you create a node "Result" or do a relationship connecting a to b. In my example below, I create a new node Result to store the pairs.
CALL apoc.periodic.iterate(
"MATCH (a:TestNode) RETURN a as row",
"WITH row
MATCH (b:TestNode) WHERE row < b
AND ANY(i IN row.variant_list WHERE i IN b.variant_list)
WITH row, b
CREATE (r:Result) SET r.pairs = [row.name, b.name]",
{batchSize:10000, parallel:true, retries:0});
Result:
╒════════════════════════════╕
│"result" │
╞════════════════════════════╡
│{"pairs":["First","Second"]}│
├────────────────────────────┤
│{"pairs":["Second","Third"]}│
└────────────────────────────┘
If you want to create a relationship between the nodes, you can replace the CREATE with this command
MERGE (row)-[:PAIRS_WITH]->(b)