Home > Back-end >  How to do this Pandas filtering in PySpark?
How to do this Pandas filtering in PySpark?

Time:05-04

I want to keep all rows for the groups that have all values True on some column (i.e. no agg). This is how I would do it in Pandas:

df.groupby('some_column').filter(lambda x: x['some_bool_column'].all())

But how to do the same thing in PySpark?

CodePudding user response:

Use forall

w= Window.partitionBy('some_column')
df1.withColumn('group',collect_list('some_bool_column').over(w)).where(forall('group', lambda x:x=='true')).drop('group').show()

CodePudding user response:

I don't know the exact solution but I think of a workaround to do the same things. we can group by the 'some_column' and find the minimum of the bool columns. Since all the bool values are True, the minimum should be 1 in this case.

  1. cast the bool column to long type

  2. find the minimum of them

  3. only show some_column with min_bool_value == 1

     from pyspark.sql.functions import min, col
     df.groupBy('some_column')\
     .agg(min(col('some_bool_column').cast('long')).alias('min_bool_value')\
     .where(col('min_bool_value') == 1).show()
    
  • Related