Home > Back-end >  Using withColumn, assign value to only the last row in a group in pyspark
Using withColumn, assign value to only the last row in a group in pyspark

Time:04-23

I have a table with groups that can be ordered by time.

Group   | Time | Food
-------------------------
Fruits  | 1    | Apples
Fruits  | 3    | Ketchup
Fruits  | 5    | Bananas
Veggies | 2    | Broccoli
Veggies | 4    | Peas
Veggies | 8    | Carrots

As part of a more complicated when().otherwise() clause inside of withColumn() I need to assign a value into that new column for the last row of each group. I suspect I should use row number so I have something like this:

windowSpec = Window.partitionBy("Group").orderBy("Time")

my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last", 
            F.when(F.max("group_row").over(windowSpec) == F.col("group_row"), "Last")
             .otherwise("Not Last")) \
.show()

I would expect the result to be

Group   | Time | Food.    | group_row | is_window_last
-------------------------------------------
Fruits  | 1    | Apples   | 1         | Not Last
Fruits  | 3    | Ketchup  | 2         | Not Last
Fruits  | 5    | Bananas  | 3         | Last
Veggies | 2    | Broccoli | 1         | Not Last
Veggies | 4    | Peas     | 2         | Not Last
Veggies | 8    | Carrots  | 3         | Last

But instead I get

Group   | Time | Food.    | group_row | is_window_last
-------------------------------------------
Fruits  | 1    | Apples   | 1         | Last
Fruits  | 3    | Ketchup  | 2         | Last
Fruits  | 5    | Bananas  | 3         | Last
Veggies | 2    | Broccoli | 1         | Last
Veggies | 4    | Peas     | 2         | Last
Veggies | 8    | Carrots  | 3         | Last

I've tried

my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last", 
            F.when(F.max("group_row").over(windowSpec) == F.col("group_row").over(windowSpec), "Last")
             .otherwise("Not Last")) \
.show()

and

my_table \
.withColumn("group_row", F.row_number().over(windowSpec)) \
.withColumn("is_window_last", 
            F.when((F.max("group_row") == F.col("group_row")).over(windowSpec), "Last")
             .otherwise("Not Last")) \
.show()

but neither did what I expected.

CodePudding user response:

That's because you use the same windowSpec on both window function

windowSpec = W.partitionBy("g").orderBy("t")

(df
    .withColumn('group_row', F.row_number().over(W.partitionBy('g').orderBy('t')))
    .withColumn('max_group', F.max('group_row').over(W.partitionBy('g'))) # DON'T order by 'time' here
    .withColumn('is_window_last', F
        .when(F.col('max_group') == F.col('group_row'), 'Last')
        .otherwise('Not Last')
    )
    .show()
)

 ------- --- -------- --------- --------- -------------- 
|      g|  t|       f|group_row|max_group|is_window_last|
 ------- --- -------- --------- --------- -------------- 
| Fruits|  1|  Apples|        1|        3|      Not Last|
| Fruits|  3| Ketchup|        2|        3|      Not Last|
| Fruits|  5| Bananas|        3|        3|          Last|
|Veggies|  2|Broccoli|        1|        3|      Not Last|
|Veggies|  4|    Peas|        2|        3|      Not Last|
|Veggies|  8| Carrots|        3|        3|          Last|
 ------- --- -------- --------- --------- -------------- 

This F.max("group_row").over(windowSpec) will calculate max of group_row per row. If you remove orderBy(time) on that window function, you will get max of group_row per group

CodePudding user response:

I would use Last, row number will incur more compute effort. You will be required to compute max row number to achieve your objective.

w=Window.partitionBy('Group').orderBy('Time').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)


df.withColumn('c',last('Food').over(w)).withColumn('c', when(col('Food')==col('c'),'Last').otherwise('Not Last')).show()

 ------- ---- -------- -------- 
|  Group|Time|    Food|       c|
 ------- ---- -------- -------- 
| Fruits|   1|  Apples|Not Last|
| Fruits|   3| Ketchup|Not Last|
| Fruits|   5| Bananas|    Last|
|Veggies|   2|Broccoli|Not Last|
|Veggies|   4|    Peas|Not Last|
|Veggies|   8| Carrots|    Last|
 ------- ---- -------- -------- 
  • Related