Consider this MWE:
df = spark.createDataFrame([('A', 5, 0),('A',6, 0),('B',3, 0)], ['id', 'value', 'currentVersion'])
--- ----- --------------
| id|value|currentVersion|
--- ----- --------------
| A| 5| 0|
| A| 6| 0|
| B| 3| 0|
--- ----- --------------
With this expected output
# --- ----- ----------
#| id|value|currentVersion|
# --- ----- ----------
#| A| 5| 0|
#| A| 6| 1|
#| B| 0| 0 |
# --- ----- ----------
How can I get to the expected output while relying on groupby?
This works well for my other purposes, but fails as I need to incorporate groupby:
valueWhenTrue = 1
valueWhenFalse = 0
df = df.withColumn(
"currentVersion",
when(
F.col("TimeStamp") == df.agg({"TimeStamp": "max"}).collect()[0][0],
valueWhenTrue
).otherwise(valueWhenFalse)
)
CodePudding user response:
Found an answer that works for me:
# groupby -- find max time
window_var = Window().partitionBy('TicketNumber')
df = df.withColumn('maxModified', F.max('Modified').over(window_var))
# case when
valueWhenTrue = 1
valueWhenFalse = 0
df = df.withColumn(
"currentVersion",
when(
F.col("maxModified") == F.col('Modified'),
valueWhenTrue
).otherwise(valueWhenFalse)
)