How to do this Pandas filtering in PySpark?-CodePudding

I want to keep all rows for the groups that have all values True on some column (i.e. no agg). This is how I would do it in Pandas:

df.groupby('some_column').filter(lambda x: x['some_bool_column'].all())

But how to do the same thing in PySpark?

CodePudding user response：

Use forall

w= Window.partitionBy('some_column')
df1.withColumn('group',collect_list('some_bool_column').over(w)).where(forall('group', lambda x:x=='true')).drop('group').show()

CodePudding user response：

I don't know the exact solution but I think of a workaround to do the same things. we can group by the 'some_column' and find the minimum of the bool columns. Since all the bool values are True, the minimum should be 1 in this case.

cast the bool column to long type
find the minimum of them

only show some_column with min_bool_value == 1

 from pyspark.sql.functions import min, col
 df.groupBy('some_column')\
 .agg(min(col('some_bool_column').cast('long')).alias('min_bool_value')\
 .where(col('min_bool_value') == 1).show()