How to fill up null values in Spark Dataframe based on other columns' value?-CodePudding

Given this dataframe:

 ----- ----- ---- 
|num_a|num_b| sum|
 ----- ----- ---- 
|    1|    1|   2|
|   12|   15|  27|
|   56|   11|null|
|   79|    3|  82|
|  111|  114| 225|
 ----- ----- ----

How would you fill up Null values in sum column if the value can be gathered from other columns? In this example 56 11 would be the value.

I've tried df.fillna with an udf, but that doesn't seems to work, as it was just getting the column name not the actual value. I would want to compute the value just for the rows with missing values, so creating a new column would not be a viable option.

CodePudding user response：

You can use coalesce function. Check this sample code

import pyspark.sql.functions as f

df = spark.createDataFrame(
   [(1, 2, 3),
   (12, 15, 27),
   (56, 11, None),
   (79, 3, 82)],
   ["num_a", "num_b", "sum"]
)

df.withColumn("sum", f.coalesce(f.col("sum"), f.col("num_a")   f.col("num_b"))).show()

Output is:

 ----- ----- --- 
|num_a|num_b|sum|
 ----- ----- --- 
|    1|    2|  3|
|   12|   15| 27|
|   56|   11| 67|
|   79|    3| 82|
 ----- ----- ---

CodePudding user response：

If your requirement is UDF, then it can be done as:

import pyspark.sql.functions as F
from pyspark.sql.types import LongType

df = spark.createDataFrame(
   [(1, 2, 3),
   (12, 15, 27),
   (56, 11, None),
   (79, 3, 82)],
   ["num_a", "num_b", "sum"]
)

F.udf(returnType=LongType)
def fill_with_sum(num_a, num_b, sum): 
  return sum if sum is None else (num_a   num_b)

df = df.withColumn("sum", fill_with_sum(F.col("num_a"), F.col("num_b"), F.col("sum")))

[Out]:
 ----- ----- --- 
|num_a|num_b|sum|
 ----- ----- --- 
|    1|    2|  3|
|   12|   15| 27|
|   56|   11| 67|
|   79|    3| 82|
 ----- ----- ---