Keep rows with one particular value in one column using some condition in another column in same dat-CodePudding

I have the below dataframe:

col1	col2	col3
Device1	A	true
Device1	A	false
Device1	C	false
Device1	B	false

I want to keep first two rows (where col2 value = A) where A is identified because col3 has a 'true' in row 1. In other words, for device 1, I want to keep all those rows where col2 has at least 1 value in col3 as 'true'. I expect the below result after the filter:

col1	col2	col3
Device1	A	true
Device1	A	false

CodePudding user response：

You can do it with window functions: partitioning by col1 and col2, ordering by col3 descending. Use the first function over the window.

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [('Device1', 'A', True),
     ('Device1', 'A', False),
     ('Device1', 'C', None),
     ('Device1', 'B', False)],
    ['col1', 'col2', 'col3'])

w = W.partitionBy('col1', 'col2').orderBy(F.desc('col3'))
df = df.withColumn('keep', F.first('col3').over(w))
df = df.filter('keep').drop('keep')

df.show()
#  ------- ---- ----- 
# |   col1|col2| col3|
#  ------- ---- ----- 
# |Device1|   A| true|
# |Device1|   A|false|
#  ------- ---- -----