I have a dataframe like the example
productNo | prodcuctMT | productPR | productList |
---|---|---|---|
2389 | ['xy-5', 'yz-12','zb-56','iu-30'] | ['pr-1', 'pr-2', 'pr-3', 'pr-4'] | ['67230','7839','1339','9793'] |
6745 | ['xy-4', 'yz-34','zb-8','iu-9'] | ['pr-6', pr-1', 'pr-3', 'pr-7'] | ['1111','0987','8910','0348'] |
I would like to use elif statement for multiple conditions where we look at productMT and if it passes the condition it looks at productPR and takes the position for which it satisfies the condition.
if productMT contains xy-5 then if productPR contains pr-1 , take its position and add new column with value from productList.
productNo | prodcuctMT | productPR | productList | productList |
---|---|---|---|---|
2389 | ['xy-5', 'yz-12','zb-56','iu-30'] | ['pr-1', 'pr-2', 'pr-3', 'pr-4'] | ['67230','7839','1339','9793'] | 67230 |
I tried using a filter but it only does the work for one filter and I need to run on multiple filters so it loops through all rows and conditions.
F.arrays_zip('productList', 'prodcuctMT', 'productPR'),
lambda x: (x.prodcuctMT == 'xy-5') & (x.productPR != 'pr-1')
)
df_array_pos = df_array.withColumn('output', filtered[0].productList).withColumn('flag', filtered[0].prodcuctMT)```
CodePudding user response:
You just need to use multiple when
functions for each elif conditions you want
Your sample data
df = spark.createDataFrame([
(2389, ['xy-5', 'yz-12','zb-56','iu-30'], ['pr-1', 'pr-2', 'pr-3', 'pr-4'], ['67230','7839','1339','9793']),
(6745, ['xy-4', 'yz-34','zb-8','iu-9'], ['pr-6', 'pr-1', 'pr-3', 'pr-7'], ['1111','0987','8910','0348']),
], ['productNo', 'productMT', 'productPR', 'productList'])
--------- --------------------------- ------------------------ -------------------------
|productNo|productMT |productPR |productList |
--------- --------------------------- ------------------------ -------------------------
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |
--------- --------------------------- ------------------------ -------------------------
You can add as many when
as you like
from pyspark.sql import functions as F
(df
.withColumn('output', F
.when(F.array_contains('productMT', 'xy-5') & F.array_contains('productPR', 'pr-1'), F.col('productList')[F.array_position('productMT', 'xy-5') - 1])
)
.show(10, False)
)
--------- --------------------------- ------------------------ ------------------------- ------
|productNo|productMT |productPR |productList |output|
--------- --------------------------- ------------------------ ------------------------- ------
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|67230 |
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |null |
--------- --------------------------- ------------------------ ------------------------- ------