why am I getting column object not callable error in pyspark?-CodePudding

I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.

argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)


argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)


cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")

So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.

if fi.filter(col("col1").count() > 0).collect():
    fi.show()

error

TypeError: 'Column' object is not callable

Note:

I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.

Please suggest where am I making mistakes. Thank you

CodePudding user response：

If I understand correctly, that's simply :

fi.select(cond).collect()

The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.

CodePudding user response：

Did you import the column function?

from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
    fi.show()