I have a table with groups that can be ordered by time.
Group | Time | Food
-------------------------
Fruits | 1 | Apples
Fruits | 3 | Ketchup
Fruits | 5 | Bananas
Veggies | 2 | Broccoli
Veggies | 4 | Peas
Veggies | 8 | Carrots
As part of a more complicated when().otherwise()
clause inside of withColumn()
I need to assign a value into that new column for the last row of each group. I suspect I should use row number so I have something like this:
windowSpec = Window.partitionBy("Group").orderBy("Time")
my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last",
F.when(F.max("group_row").over(windowSpec) == F.col("group_row"), "Last")
.otherwise("Not Last")) \
.show()
I would expect the result to be
Group | Time | Food. | group_row | is_window_last
-------------------------------------------
Fruits | 1 | Apples | 1 | Not Last
Fruits | 3 | Ketchup | 2 | Not Last
Fruits | 5 | Bananas | 3 | Last
Veggies | 2 | Broccoli | 1 | Not Last
Veggies | 4 | Peas | 2 | Not Last
Veggies | 8 | Carrots | 3 | Last
But instead I get
Group | Time | Food. | group_row | is_window_last
-------------------------------------------
Fruits | 1 | Apples | 1 | Last
Fruits | 3 | Ketchup | 2 | Last
Fruits | 5 | Bananas | 3 | Last
Veggies | 2 | Broccoli | 1 | Last
Veggies | 4 | Peas | 2 | Last
Veggies | 8 | Carrots | 3 | Last
I've tried
my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last",
F.when(F.max("group_row").over(windowSpec) == F.col("group_row").over(windowSpec), "Last")
.otherwise("Not Last")) \
.show()
and
my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last",
F.when((F.max("group_row") == F.col("group_row")).over(windowSpec), "Last")
.otherwise("Not Last")) \
.show()
but neither did what I expected.
CodePudding user response:
That's because you use the same windowSpec
on both window function
windowSpec = W.partitionBy("g").orderBy("t")
(df
.withColumn('group_row', F.row_number().over(W.partitionBy('g').orderBy('t')))
.withColumn('max_group', F.max('group_row').over(W.partitionBy('g'))) # DON'T order by 'time' here
.withColumn('is_window_last', F
.when(F.col('max_group') == F.col('group_row'), 'Last')
.otherwise('Not Last')
)
.show()
)
------- --- -------- --------- --------- --------------
| g| t| f|group_row|max_group|is_window_last|
------- --- -------- --------- --------- --------------
| Fruits| 1| Apples| 1| 3| Not Last|
| Fruits| 3| Ketchup| 2| 3| Not Last|
| Fruits| 5| Bananas| 3| 3| Last|
|Veggies| 2|Broccoli| 1| 3| Not Last|
|Veggies| 4| Peas| 2| 3| Not Last|
|Veggies| 8| Carrots| 3| 3| Last|
------- --- -------- --------- --------- --------------
This F.max("group_row").over(windowSpec)
will calculate max of group_row
per row. If you remove orderBy(time)
on that window function, you will get max of group_row
per group
CodePudding user response:
I would use Last, row number will incur more compute effort. You will be required to compute max row number to achieve your objective.
w=Window.partitionBy('Group').orderBy('Time').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('c',last('Food').over(w)).withColumn('c', when(col('Food')==col('c'),'Last').otherwise('Not Last')).show()
------- ---- -------- --------
| Group|Time| Food| c|
------- ---- -------- --------
| Fruits| 1| Apples|Not Last|
| Fruits| 3| Ketchup|Not Last|
| Fruits| 5| Bananas| Last|
|Veggies| 2|Broccoli|Not Last|
|Veggies| 4| Peas|Not Last|
|Veggies| 8| Carrots| Last|
------- ---- -------- --------