subtract column values over a window function-CodePudding

This is probably very simple but for some reason can't figure it out. I have a df that looks like this. Need to create a column where I subtract median 2&3avg col from median col when Q = 1. I assume you need to use window function and then create a UDF to subtract the columns? Or am I overcomplicating this?

 ---------- -------- --- ------------ -------- 
|      date| median | Q | 2&3avg     | result |
 ---------- -------- --- ------------ -------- 
|2018-03-31|       6|  1|          15|      -9|
|2018-03-31|      27|  2|          15|      -9|
|2018-03-31|       3|  3|          15|      -9|
|2018-03-31|      44|  4|          15|      -9|
|2018-06-30|       6|  1|          18|     -12|
|2018-06-30|       4|  3|          18|     -12|
|2018-06-30|      32|  2|          18|     -12|
|2018-06-30|     112|  4|          18|     -12|
|2018-09-30|       2|  1|          20|     -18|
|2018-09-30|      23|  4|          20|     -18|
|2018-09-30|      37|  3|          20|     -18|
|2018-09-30|       3|  2|          20|     -18|
 ---------- -------- --- ------------ --------

CodePudding user response：

You can indeed calculate the median using a window function. To avoid using an UDF, you can use the expr to get the percentile. Then you can use a simple when to do the calculation only in Q = 1:

from pyspark.sql import Window
import pyspark.sql.functions as F

w = W.partitionBy('Q')
percentile = F.expr('percentile_approx(2and3avg, 0.5)')

df\
    .withColumn('med_val', percentile.over(w))\
    .withColumn('new_col', F.when(F.col('Q') == 1, F.col('med_val')-F.col('median')
                                 ).otherwise(F.lit('N/A')))\
    .orderBy('date', 'Q')\
    .show()

 ---------- ------ --- -------- ------ ------- ------- ------- 
|      date|median|  Q|2and3avg|result|2result|med_val|new_col|
 ---------- ------ --- -------- ------ ------- ------- ------- 
|2018-03-31|     6|  1|      15|    -9|     -9|     18|     12|
|2018-03-31|    27|  2|      15|    -9|     -9|     18|    N/A|
|2018-03-31|     3|  3|      15|    -9|     -9|     18|    N/A|
|2018-03-31|    44|  4|      15|    -9|     -9|     18|    N/A|
|2018-06-30|     6|  1|      18|   -12|    -12|     18|     12|
|2018-06-30|    32|  2|      18|   -12|    -12|     18|    N/A|
|2018-06-30|     4|  3|      18|   -12|    -12|     18|    N/A|
|2018-06-30|   112|  4|      18|   -12|    -12|     18|    N/A|
|2018-09-30|     2|  1|      20|   -18|    -18|     18|     16|
|2018-09-30|     3|  2|      20|   -18|    -18|     18|    N/A|
|2018-09-30|    37|  3|      20|   -18|    -18|     18|    N/A|
|2018-09-30|    23|  4|      20|   -18|    -18|     18|    N/A|
 ---------- ------ --- -------- ------ ------- ------- -------

CodePudding user response：

you can use when otherwise function of pyspark

since you have to substract only where Q==1 so for any other value result will be null

>>> from pyspark.sql.functions import when
>>> df.withColumn("result",when(col('Q')==1,(df['median']-df['2_3avg'])).otherwise("nulll")).show()
 ---------- ------ --- ------ ------ 
|      date|median|  Q|2_3avg|result|
 ---------- ------ --- ------ ------ 
|2018-03-31|     6|  1|    15|    -9|
|2018-03-31|    27|  2|    15| nulll|
|2018-03-31|     3|  3|    15| nulll|
|2018-03-31|    44|  4|    15| nulll|
|2018-06-30|     6|  1|    18|   -12|
|2018-06-30|     4|  3|    18| nulll|
|2018-06-30|    32|  2|    18| nulll|
|2018-06-30|   112|  4|    18| nulll|
|2018-09-30|     2|  1|    20|   -18|
|2018-09-30|    23|  4|    20| nulll|
|2018-09-30|    37|  3|    20| nulll|
|2018-09-30|     3|  2|    20| nulll|
 ---------- ------ --- ------ ------

CodePudding user response：

i think this answers your question

>>> df1=df.withColumn("result",when(col('Q')==1,(df['median']-df['2_3avg'])).otherwise("nulll"))
>>> df1.groupby("date").agg(F.collect_list("median"),F.collect_list("Q"),F.collect_list("2_3avg"),F.collect_list("result")).show()
 ---------- -------------------- --------------- -------------------- -------------------- 
|      date|collect_list(median)|collect_list(Q)|collect_list(2_3avg)|collect_list(result)|
 ---------- -------------------- --------------- -------------------- -------------------- 
|2018-06-30|     [6, 4, 32, 112]|   [1, 3, 2, 4]|    [18, 18, 18, 18]|[-12, nulll, null...|
|2018-03-31|      [6, 27, 3, 44]|   [1, 2, 3, 4]|    [15, 15, 15, 15]|[-9, nulll, nulll...|
|2018-09-30|      [2, 23, 37, 3]|   [1, 4, 3, 2]|    [20, 20, 20, 20]|[-18, nulll, null...|
 ---------- -------------------- --------------- -------------------- --------------------