Home > database >  PySpark: Case When Groupby
PySpark: Case When Groupby

Time:06-05

Consider this MWE:

df = spark.createDataFrame([('A', 5, 0),('A',6, 0),('B',3, 0)], ['id', 'value', 'currentVersion'])
 --- ----- -------------- 
| id|value|currentVersion|
 --- ----- -------------- 
|  A|    5|             0|
|  A|    6|             0|
|  B|    3|             0|
 --- ----- -------------- 

With this expected output

# --- ----- ---------- 
#| id|value|currentVersion|
# --- ----- ---------- 
#|  A|    5|       0|
#|  A|    6|       1|
#|  B|    0|       0 |
# --- ----- ---------- 

How can I get to the expected output while relying on groupby?

This works well for my other purposes, but fails as I need to incorporate groupby:

valueWhenTrue = 1
valueWhenFalse = 0

df = df.withColumn(
    "currentVersion",
    when(
        F.col("TimeStamp") == df.agg({"TimeStamp": "max"}).collect()[0][0],
        valueWhenTrue
    ).otherwise(valueWhenFalse)
)

CodePudding user response:

Found an answer that works for me:

# groupby -- find max time
window_var = Window().partitionBy('TicketNumber')
df = df.withColumn('maxModified', F.max('Modified').over(window_var))


# case when
valueWhenTrue = 1
valueWhenFalse = 0

df = df.withColumn(
    "currentVersion",
    when(
        F.col("maxModified") == F.col('Modified'),
        valueWhenTrue
    ).otherwise(valueWhenFalse)
)
  • Related