Home > Software engineering >  Keep rows with one particular value in one column using some condition in another column in same dat
Keep rows with one particular value in one column using some condition in another column in same dat

Time:07-09

I have the below dataframe:

col1 col2 col3
Device1 A true
Device1 A false
Device1 C false
Device1 B false

I want to keep first two rows (where col2 value = A) where A is identified because col3 has a 'true' in row 1. In other words, for device 1, I want to keep all those rows where col2 has at least 1 value in col3 as 'true'. I expect the below result after the filter:

col1 col2 col3
Device1 A true
Device1 A false

CodePudding user response:

You can do it with window functions: partitioning by col1 and col2, ordering by col3 descending. Use the first function over the window.

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [('Device1', 'A', True),
     ('Device1', 'A', False),
     ('Device1', 'C', None),
     ('Device1', 'B', False)],
    ['col1', 'col2', 'col3'])

w = W.partitionBy('col1', 'col2').orderBy(F.desc('col3'))
df = df.withColumn('keep', F.first('col3').over(w))
df = df.filter('keep').drop('keep')

df.show()
#  ------- ---- ----- 
# |   col1|col2| col3|
#  ------- ---- ----- 
# |Device1|   A| true|
# |Device1|   A|false|
#  ------- ---- ----- 
  • Related