Given this dataframe:
----- ----- ----
|num_a|num_b| sum|
----- ----- ----
| 1| 1| 2|
| 12| 15| 27|
| 56| 11|null|
| 79| 3| 82|
| 111| 114| 225|
----- ----- ----
How would you fill up Null
values in sum column if the value can be gathered from other columns? In this example 56 11 would be the value.
I've tried df.fillna
with an udf, but that doesn't seems to work, as it was just getting the column name not the actual value. I would want to compute the value just for the rows with missing values, so creating a new column would not be a viable option.
CodePudding user response:
You can use coalesce function. Check this sample code
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
df.withColumn("sum", f.coalesce(f.col("sum"), f.col("num_a") f.col("num_b"))).show()
Output is:
----- ----- ---
|num_a|num_b|sum|
----- ----- ---
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
----- ----- ---
CodePudding user response:
If your requirement is UDF, then it can be done as:
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
F.udf(returnType=LongType)
def fill_with_sum(num_a, num_b, sum):
return sum if sum is None else (num_a num_b)
df = df.withColumn("sum", fill_with_sum(F.col("num_a"), F.col("num_b"), F.col("sum")))
[Out]:
----- ----- ---
|num_a|num_b|sum|
----- ----- ---
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
----- ----- ---