I have two columns in my dataframe: "Cust_ID" and "Fill_days". I need to add a column for "Adjusted Days", like this:
The idea: for every first row for "Cust_ID", the "Adjusted Days" value should be 0. For the following rows,
if previous row's "Fill_days" "Adjusted Days" < 0 then 0, else previous row's "Fill_days" "Adjusted Days".
If we need to do this in Excel. the following formula is used:
C2=0
C3=IF(B2 C2<0,0,B2 C2)
C4=IF(B3 C3<0,0,B3 C3)
I was able to write the code using pandas, however the code is slow.
How to do it in Spark? I use Spark 3.2.1.
CodePudding user response:
First of all, you need a column for order. Spark does not know the order. All the rows can be at any position unless you have a column which can tell exact order. I have added the column "order".
Secondly, referencing the same column is not possible per se. You need a workaround. The following will collect date into a list for every "Cust_ID" and do operations only within this list. When it's done, explode the result using inline
.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, 5),
(1, 2, 2),
(1, 3, 1),
(1, 4, -9),
(1, 5, -2),
(1, 6, 9),
(5, 1, -2),
(5, 2, 1),
(5, 3, -1)],
['Cust_ID', 'order', 'Fill_days'])
Script:
df = df.groupBy('Cust_ID').agg(
F.aggregate(
F.array_sort(F.collect_list(F.struct('order', 'Fill_days'))),
F.expr("array(struct(bigint(null) order, 0L Fill_days, 0L Adjusted_Days))"),
lambda acc, x: F.array_union(
acc,
F.array(x.withField(
'Adjusted_Days',
F.greatest(F.lit(0), F.element_at(acc, -1)['Fill_days'] F.element_at(acc, -1)['Adjusted_Days'])
))
)
).alias('a')
)
df = df.selectExpr("Cust_ID", "inline(slice(a, 2, size(a)))")
df.show()
# ------- ----- --------- -------------
# |Cust_ID|order|Fill_days|Adjusted_Days|
# ------- ----- --------- -------------
# |1 |1 |5 |0 |
# |1 |2 |2 |5 |
# |1 |3 |1 |7 |
# |1 |4 |-9 |8 |
# |1 |5 |-2 |0 |
# |1 |6 |9 |0 |
# |5 |1 |-2 |0 |
# |5 |2 |1 |0 |
# |5 |3 |-1 |1 |
# ------- ----- --------- -------------
For understanding, please analyze this answer, as it's not an easy task to explain it for the 2nd time.