I want to keep all rows for the groups that have all values True on some column (i.e. no agg). This is how I would do it in Pandas:
df.groupby('some_column').filter(lambda x: x['some_bool_column'].all())
But how to do the same thing in PySpark?
CodePudding user response:
Use forall
w= Window.partitionBy('some_column')
df1.withColumn('group',collect_list('some_bool_column').over(w)).where(forall('group', lambda x:x=='true')).drop('group').show()
CodePudding user response:
I don't know the exact solution but I think of a workaround to do the same things. we can group by the 'some_column' and find the minimum of the bool columns. Since all the bool values are True, the minimum should be 1 in this case.
cast the bool column to long type
find the minimum of them
only show some_column with min_bool_value == 1
from pyspark.sql.functions import min, col df.groupBy('some_column')\ .agg(min(col('some_bool_column').cast('long')).alias('min_bool_value')\ .where(col('min_bool_value') == 1).show()