Sub query like SQL in pyspark-CodePudding

I'm trying to do this kind of query:

SELECT age,COUNT(age)
   FROM T
   GROUP BY age
   HAVING age = MIN(SELECT COUNT(age) FROM T GROUP BY age)
   ODER BY COUNT(age)

I tried

min_size = df.groupBy("age").count().select(f.min("count"))
df.groupBy("age").count().sort("count").filter(f.col("count")==min_size).show()

but I get AttributeError: 'DataFrame' object has no attribute '_get_object_id'

Is there any way to use subqueries in PySpark?

CodePudding user response：

In your case, min_size is a DataFrame, not some integer.
Try to collect it into integer like this:

min_size = df.groupBy("age").count().select(f.min("count")).collect()[0][0]