I have Column where each row is a StructField. I want to get max of two values in the StructField.
I tried this
trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key))
But it throws this error
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I am now getting it done with UDFs
max_key = lambda x: x if x else float("-inf")
_get_max_udf = udf(lambda x, y: max(x,y, key=max_key), FloatType())
trends_df = trends_df.withColumn("importance_score", _get_max_udf(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
This works, but I want to know if there a way I can avoid using the udf and get it done with just spark.
Edit:
This is the result of trends_df.printSchema()
root
|-- avg_total: struct (nullable = true)
| |-- max: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- max_index: long (nullable = true)
| | |-- max_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
| |-- min: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- min_index: long (nullable = true)
| | |-- min_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
CodePudding user response:
Adding an answer from the comments to highlight it.
As answered by @smurphy I used the greatest
function
trends_df = trends_df.withColumn("importance_score", greatest(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest