I have a spark dataframe:
Location Month Brand Sector TrueValue PickoutValue
USA 1/1/2021 brand1 cars1 7418 30000
USA 2/1/2021 brand1 cars1 1940 2000
USA 3/1/2021 brand1 cars1 4692 2900
USA 4/1/2021 brand1 cars1
USA 1/1/2021 brand2 cars2 16383104.2 16666667
USA 2/1/2021 brand2 cars2 26812874.2 16666667
USA 3/1/2021 brand2 cars2
USA 1/1/2021 brand3 cars3 75.6% 70.0%
USA 3/1/2021 brand3 cars3 73.1% 70.0%
USA 2/1/2021 brand3 cars3 77.1% 70.0%
I need to calculate the cumulative sum for brand1, brand2 and cumulative average for brand3 and load that values in TotalSumValue column
My expected dataframe is:
-------- -------- ------ ------ ---------- ------------ ------------------- -------------
# |Location| Month| Brand|Sector| TrueValue|PickoutValue| month_in_timestamp|TotalSumValue|
# -------- -------- ------ ------ ---------- ------------ ------------------- -------------
# | USA|1/1/2021|brand1| cars1| 7418| 30000|2021-01-01 00:00:00| 7418.0|
# | USA|2/1/2021|brand1| cars1| 1940| 2000|2021-01-02 00:00:00| 9358.0|
# | USA|3/1/2021|brand1| cars1| 4692| 2900|2021-01-03 00:00:00| 14050.0|
# | USA|4/1/2021|brand1| cars1| null| null|2021-01-04 00:00:00| 14050.0|
# | USA|1/1/2021|brand2| cars2|16383104.2| 16666667|2021-01-01 00:00:00| 16383104.2|
# | USA|2/1/2021|brand2| cars2|26812874.2| 16666667|2021-01-02 00:00:00| 43195978.4|
# | USA|3/1/2021|brand2| cars2| null| null|2021-01-03 00:00:00| 43195978.4|
# | USA|1/1/2021|brand3| cars3| 75.6| 70.0|2021-01-01 00:00:00| 75.6|
# | USA|2/1/2021|brand3| cars3| 77.1| 70.0|2021-01-02 00:00:00| 76.4|
# | USA|3/1/2021|brand3| cars3| 73.1| 70.0|2021-01-03 00:00:00| 75.3|
# -------- -------- ------ ------ ---------- ------------ ------------------- -------------
Im trying with this code, but I'm getting null for all the rows in TotalSumValue column.
windowval=(Window.partitionBy('Location','Brand').orderBy('month_in_timestamp')
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('TotalSumValue',
F.when(F.col('Brand').isin('brand1', 'brand2'), F.sum('TrueValue').over(windowval)),
F.when(F.col('Brand').isin('brand3'), F.avg('TrueValue').over(windowval)))
CodePudding user response:
You need to chain when()
clauses as you want to populate one single column:
windowval=(Window.partitionBy('Location','Brand').orderBy('month_in_timestamp')
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('TotalSumValue',
F.when(F.col('Brand').isin('brand1', 'brand2'), F.sum('TrueValue').over(windowval)) \
.when(F.col('Brand').isin('brand3'), F.avg('TrueValue').over(windowval)))