Pyspark SQL WHERE NOT IN?-CodePudding

I'm trying to find everything that is in nodes2 but not nodes1.

spark.sql("""
SELECT COUNT(*) FROM 
    (SELECT * FROM nodes2 WHERE NOT IN  
    (SELECT * FROM nodes1))
""").show()

Getting the following error: "cannot resolve 'NOT' given input columns: [nodes2.~id, nodes2.~label];

Is it possible to do this sort of set difference operation in Pyspark?

CodePudding user response：

Matching single column with NOT IN:

Do you need to define some columns with where? which you trying to match for NOT operator?

If that is the case, then, for example, you want to check id

spark.sql("""
SELECT COUNT(*) FROM 
    (SELECT * FROM nodes2 WHERE id NOT IN  
    (SELECT id FROM nodes1))
""").show()

Matching multiple columns (or complete row) with NOT IN:

Or if you really want to match complete row (all columns), use something like concat on all columns to match

spark.sql("""
SELECT COUNT(*) FROM 
    (SELECT * FROM nodes2 as WHERE CONCAT(id,label) NOT IN (SELECT CONCAT(id,label) FROM nodes1))
""").show()

or with alias

spark.sql("""
SELECT COUNT(*) FROM 
    (SELECT * FROM nodes2 n2 as WHERE CONCAT(n2.id,n2.label) NOT IN (SELECT CONCAT(n1.id,n1.label) FROM nodes1 n1))
""").show()