pyspark value of column when other column has first nonmissing value-CodePudding

Suppose I have the following pyspark dataframe df:

id     date   var1   var2
1         1   NULL      2
1         2      b      3
2         1      a   NULL
2         2      a      1

I want the first non missing observation for all var* columns and additionally the value of date where this is from, i.e. the final result should look like:

id     var1   dt_var1  var2    dt_var2
1         b         2     2          1  
2         a         1     1          2

Getting the values is straightforward using

df.orderBy(['id','date']).groupby('id').agg(
  *[F.first(x, ignorenulls=True).alias(x) for x in ['var1', 'var2']]
)

But I fail to see how I could get the respective dates. I could loop variable for variable, drop missing, and keep the first row. But this sounds like a poor solution that will not scale well, as it would require a separate dataframe for each variable.

I would prefer a solution that scales to many columns (var3, var4,...)

CodePudding user response：

You should not use groupby if you want to get the first non-null according to date ordering. The order is not guaranteed after a groupby operation even if you called orderby just before.

You need to use window functions instead. To get the date associated with each var value you can use this trick with structs:

from pyspark.sql import Window, functions as F

w = (Window.partitionBy("id").orderBy("date")
     .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
     )

df1 = df.select(
    "id",
    *[F.first(
        F.when(F.col(x).isNotNull(), F.struct(x, F.col("date").alias(f"dt_{x}"))),
        ignorenulls=True).over(w).alias(x)
      for x in ["var1", "var2"]
      ]
).distinct().select("id", "var1.*", "var2.*")

df1.show()
# --- ---- ------- ---- ------- 
#| id|var1|dt_var1|var2|dt_var2|
# --- ---- ------- ---- ------- 
#|  1|   b|      2|   2|      1|
#|  2|   a|      1|   1|      2|
# --- ---- ------- ---- -------