Home > Software design >  Pyspark : How to select the dataframe with condition
Pyspark : How to select the dataframe with condition

Time:10-03

I have This data frame, I would like to browse the array, when the condition is 1, I stop and take the maximum of value. For exemple, I take the max(16,8)= 8 ; max(6,10,9,8)=10.

Identifiant Value Condition
ID1 16 1
ID2 8 1
ID3 4 0
ID4 5 0
ID5 6 1
ID6 10 1
ID7 9 1
ID8 8 1
ID9 9 0
ID10 11 0
ID11 6 1
ID12 8 1
ID13 10 0
ID13 12 1
ID14 15 0
ID15 14 1
ID16 8 1
ID17 9 1

we obtain This table :

Identifiant Value
ID1 16 max(16,8)
ID6 10 max(6,10,9,8)
ID12 8 max(6,8)
ID13 12 max(12)
ID15 14 max(14,8,9)

CodePudding user response:

You can create seperate windows for this. First Create a window to calculate a helper column which assigns each group i.e. 1,1 gets 1, then 0,0 gets 2,then 1,1,1 gets 3 and so on. Once we have that we can use Condition column to filter out zeroes and the get max per partition of the helper column after getting a desending rank to get the max value per group:

from pyspark.sql import functions as F, Window as W

w = W.orderBy("Identifiant")
w1 = w.rangeBetween(W.unboundedPreceding,0)
w2 = W.partitionBy("Helper").orderBy(F.desc("Value"))

c = (F.col("Condition")!=F.lag("Condition").over(w)).cast("int")
out = (df.withColumn("Helper",c).fillna({"Helper":0})
         .withColumn("Helper",F.sum("Helper").over(w1)).filter("Condition!=0")
.withColumn("Rnk",F.dense_rank().over(w2)).filter("Rnk==1")
    .drop("Helper","Rnk","Condition")
)

out.show()

 ----------- ----- 
|Identifiant|Value|
 ----------- ----- 
|        ID1|   16|
|        ID6|   10|
|       ID12|    8|
|       ID13|   12|
|       ID15|   14|
 ----------- ----- 
  • Related