How to use first() with a function where column has null values and group it by another column in py-CodePudding

i have a DF where I want to calculate the first possible value as a chr() in column "A" (which is not null) and add the result to a new column "C" grouped by column "Date". (As an information chr(97)=a)

The result should look something like that:

 ---------- ---------- --- --- 
|Date      |A         |B  |C  |
 ---------- ---------- --- --- 
|21.02.2022|null      |12  |a  |
|21.02.2022|null      |23  |a  |
|21.02.2022|97        |25  |a  |
|22.02.2022|98        |28  |b  |
|22.02.2022|120       |29  |b  |
|22.02.2022|121       |19  |b  |
|22.02.2022|null      |10  |b  |
 ---------- ---------- --- ---

CodePudding user response：

I assume that by "first possible value" you mean the lowest value per Date. Otherwise your problem is not well defined, because Spark does not preserve there ordering of the rows in a data frame unless sort() or .orderBy() is called.

If the assumption is correct you could solve it like this:

from pyspark.sql import Window, functions as F
(
    df
    .withColumn('C', F.first('A', ignorenulls=True).over(Window.partitionBy('Date').orderBy('A')))
    .withColumn('C', F.chr('C'))
)

CodePudding user response：

You can use the collect_list windowing function partitioned by Date to get an array of A values.