I have a data frame like below in pyspark
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.11, 0),
(125, 2897402, 43.25, 2),
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])
expected_output
-------- ----------- --------------------- ----------------- -------------
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
-------- ----------- --------------------- ----------------- -------------
| 123| 2897402| 43.25| 2| 43.25|
| 124| 2897402| 49.11| 1| 49.1|
| 125| 2897402| 43.25| 2| 43.25|
| 126| 2897402| 48.75| 0| 49.0|
-------- ----------- --------------------- ----------------- -------------
I have tried like below
df1 = df.withColumn("reading_value", f.round(f.col("summary_measure_value"), f.col("reading_precision")))
I am getting Column is not iterable
error.
How can I achieve what I want
CodePudding user response:
You may try using a udf
that uses python's built in round function to achieve this eg:
@f.udf
def udf_round(value,precision):
try:
precision = int(precision)
value = float(value)
# use python built-in round function to round values
return round(value,precision)
except:
# decide what to return when you encounter bad data
# in this example I've returned the original value
return value
df=df.withColumn("reading_value",udf_round( f.col("summary_measure_value"),f.col("reading_precision") ))
df.show(truncate=False)
Outputs:
-------- ----------- --------------------- ----------------- -------------
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
-------- ----------- --------------------- ----------------- -------------
|123 |2897402 |43.25 |2 |43.25 |
|124 |2897402 |49.25 |0 |49.0 |
|125 |2897402 |43.25 |2 |43.25 |
|126 |2897402 |48.75 |0 |49.0 |
-------- ----------- --------------------- ----------------- -------------