I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
- I have 3 columns as a joining condition which is in a list and assigned to a variable
cond
, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due tojoin
.
Please suggest where am I making mistakes. Thank you
CodePudding user response:
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti
already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct
before the collect to remove duplicates.
CodePudding user response:
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()