This is probably very simple but for some reason can't figure it out. I have a df that looks like this. Need to create a column where I subtract median 2&3avg col from median col when Q = 1. I assume you need to use window function and then create a UDF to subtract the columns? Or am I overcomplicating this?
---------- -------- --- ------------ --------
| date| median | Q | 2&3avg | result |
---------- -------- --- ------------ --------
|2018-03-31| 6| 1| 15| -9|
|2018-03-31| 27| 2| 15| -9|
|2018-03-31| 3| 3| 15| -9|
|2018-03-31| 44| 4| 15| -9|
|2018-06-30| 6| 1| 18| -12|
|2018-06-30| 4| 3| 18| -12|
|2018-06-30| 32| 2| 18| -12|
|2018-06-30| 112| 4| 18| -12|
|2018-09-30| 2| 1| 20| -18|
|2018-09-30| 23| 4| 20| -18|
|2018-09-30| 37| 3| 20| -18|
|2018-09-30| 3| 2| 20| -18|
---------- -------- --- ------------ --------
CodePudding user response:
You can indeed calculate the median using a window function. To avoid using an UDF, you can use the expr to get the percentile. Then you can use a simple when to do the calculation only in Q = 1:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = W.partitionBy('Q')
percentile = F.expr('percentile_approx(2and3avg, 0.5)')
df\
.withColumn('med_val', percentile.over(w))\
.withColumn('new_col', F.when(F.col('Q') == 1, F.col('med_val')-F.col('median')
).otherwise(F.lit('N/A')))\
.orderBy('date', 'Q')\
.show()
---------- ------ --- -------- ------ ------- ------- -------
| date|median| Q|2and3avg|result|2result|med_val|new_col|
---------- ------ --- -------- ------ ------- ------- -------
|2018-03-31| 6| 1| 15| -9| -9| 18| 12|
|2018-03-31| 27| 2| 15| -9| -9| 18| N/A|
|2018-03-31| 3| 3| 15| -9| -9| 18| N/A|
|2018-03-31| 44| 4| 15| -9| -9| 18| N/A|
|2018-06-30| 6| 1| 18| -12| -12| 18| 12|
|2018-06-30| 32| 2| 18| -12| -12| 18| N/A|
|2018-06-30| 4| 3| 18| -12| -12| 18| N/A|
|2018-06-30| 112| 4| 18| -12| -12| 18| N/A|
|2018-09-30| 2| 1| 20| -18| -18| 18| 16|
|2018-09-30| 3| 2| 20| -18| -18| 18| N/A|
|2018-09-30| 37| 3| 20| -18| -18| 18| N/A|
|2018-09-30| 23| 4| 20| -18| -18| 18| N/A|
---------- ------ --- -------- ------ ------- ------- -------
CodePudding user response:
you can use when otherwise function of pyspark
since you have to substract only where Q==1 so for any other value result will be null
>>> from pyspark.sql.functions import when
>>> df.withColumn("result",when(col('Q')==1,(df['median']-df['2_3avg'])).otherwise("nulll")).show()
---------- ------ --- ------ ------
| date|median| Q|2_3avg|result|
---------- ------ --- ------ ------
|2018-03-31| 6| 1| 15| -9|
|2018-03-31| 27| 2| 15| nulll|
|2018-03-31| 3| 3| 15| nulll|
|2018-03-31| 44| 4| 15| nulll|
|2018-06-30| 6| 1| 18| -12|
|2018-06-30| 4| 3| 18| nulll|
|2018-06-30| 32| 2| 18| nulll|
|2018-06-30| 112| 4| 18| nulll|
|2018-09-30| 2| 1| 20| -18|
|2018-09-30| 23| 4| 20| nulll|
|2018-09-30| 37| 3| 20| nulll|
|2018-09-30| 3| 2| 20| nulll|
---------- ------ --- ------ ------
CodePudding user response:
i think this answers your question
>>> df1=df.withColumn("result",when(col('Q')==1,(df['median']-df['2_3avg'])).otherwise("nulll"))
>>> df1.groupby("date").agg(F.collect_list("median"),F.collect_list("Q"),F.collect_list("2_3avg"),F.collect_list("result")).show()
---------- -------------------- --------------- -------------------- --------------------
| date|collect_list(median)|collect_list(Q)|collect_list(2_3avg)|collect_list(result)|
---------- -------------------- --------------- -------------------- --------------------
|2018-06-30| [6, 4, 32, 112]| [1, 3, 2, 4]| [18, 18, 18, 18]|[-12, nulll, null...|
|2018-03-31| [6, 27, 3, 44]| [1, 2, 3, 4]| [15, 15, 15, 15]|[-9, nulll, nulll...|
|2018-09-30| [2, 23, 37, 3]| [1, 4, 3, 2]| [20, 20, 20, 20]|[-18, nulll, null...|
---------- -------------------- --------------- -------------------- --------------------