Calculate cumulative sum and average based on column values in spark dataframe-CodePudding

I have a spark dataframe:

Location    Month       Brand   Sector  TrueValue   PickoutValue
USA         1/1/2021    brand1  cars1   7418        30000       
USA         2/1/2021    brand1  cars1   1940        2000        
USA         3/1/2021    brand1  cars1   4692        2900        
USA         4/1/2021    brand1  cars1                           
USA         1/1/2021    brand2  cars2   16383104.2  16666667    
USA         2/1/2021    brand2  cars2   26812874.2  16666667    
USA         3/1/2021    brand2  cars2                           
USA         1/1/2021    brand3  cars3   75.6%       70.0%
USA         3/1/2021    brand3  cars3   73.1%       70.0%
USA         2/1/2021    brand3  cars3   77.1%       70.0%

I need to calculate the cumulative sum for brand1, brand2 and cumulative average for brand3 and load that values in TotalSumValue column

My expected dataframe is:

 -------- -------- ------ ------ ---------- ------------ ------------------- ------------- 
# |Location|   Month| Brand|Sector| TrueValue|PickoutValue| month_in_timestamp|TotalSumValue|
#  -------- -------- ------ ------ ---------- ------------ ------------------- ------------- 
# |     USA|1/1/2021|brand1| cars1|      7418|       30000|2021-01-01 00:00:00|       7418.0|
# |     USA|2/1/2021|brand1| cars1|      1940|        2000|2021-01-02 00:00:00|       9358.0|
# |     USA|3/1/2021|brand1| cars1|      4692|        2900|2021-01-03 00:00:00|      14050.0|
# |     USA|4/1/2021|brand1| cars1|      null|        null|2021-01-04 00:00:00|      14050.0|
# |     USA|1/1/2021|brand2| cars2|16383104.2|    16666667|2021-01-01 00:00:00|   16383104.2|
# |     USA|2/1/2021|brand2| cars2|26812874.2|    16666667|2021-01-02 00:00:00|   43195978.4|
# |     USA|3/1/2021|brand2| cars2|      null|        null|2021-01-03 00:00:00|   43195978.4|
# |     USA|1/1/2021|brand3| cars3|      75.6|        70.0|2021-01-01 00:00:00|         75.6|
# |     USA|2/1/2021|brand3| cars3|      77.1|        70.0|2021-01-02 00:00:00|         76.4|
# |     USA|3/1/2021|brand3| cars3|      73.1|        70.0|2021-01-03 00:00:00|         75.3|
#  -------- -------- ------ ------ ---------- ------------ ------------------- -------------

Im trying with this code, but I'm getting null for all the rows in TotalSumValue column.

windowval=(Window.partitionBy('Location','Brand').orderBy('month_in_timestamp')
               .rangeBetween(Window.unboundedPreceding, 0))

df = df.withColumn('TotalSumValue',
                F.when(F.col('Brand').isin('brand1', 'brand2'), F.sum('TrueValue').over(windowval)),
                F.when(F.col('Brand').isin('brand3'), F.avg('TrueValue').over(windowval)))

CodePudding user response：

You need to chain when() clauses as you want to populate one single column:

windowval=(Window.partitionBy('Location','Brand').orderBy('month_in_timestamp')
               .rangeBetween(Window.unboundedPreceding, 0))

df = df.withColumn('TotalSumValue',
         F.when(F.col('Brand').isin('brand1', 'brand2'), F.sum('TrueValue').over(windowval)) \
         .when(F.col('Brand').isin('brand3'), F.avg('TrueValue').over(windowval)))