I have the below dataframe:
col1 | col2 | col3 |
---|---|---|
Device1 | A | true |
Device1 | A | false |
Device1 | C | false |
Device1 | B | false |
I want to keep first two rows (where col2 value = A) where A is identified because col3 has a 'true' in row 1. In other words, for device 1, I want to keep all those rows where col2 has at least 1 value in col3 as 'true'. I expect the below result after the filter:
col1 | col2 | col3 |
---|---|---|
Device1 | A | true |
Device1 | A | false |
CodePudding user response:
You can do it with window functions: partitioning by col1 and col2, ordering by col3 descending. Use the first
function over the window.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('Device1', 'A', True),
('Device1', 'A', False),
('Device1', 'C', None),
('Device1', 'B', False)],
['col1', 'col2', 'col3'])
w = W.partitionBy('col1', 'col2').orderBy(F.desc('col3'))
df = df.withColumn('keep', F.first('col3').over(w))
df = df.filter('keep').drop('keep')
df.show()
# ------- ---- -----
# | col1|col2| col3|
# ------- ---- -----
# |Device1| A| true|
# |Device1| A|false|
# ------- ---- -----