Home > Blockchain >  Comparing two values in a structfield of a column in pyspark
Comparing two values in a structfield of a column in pyspark


I have Column where each row is a StructField. I want to get max of two values in the StructField.

I tried this

trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key))

But it throws this error

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I am now getting it done with UDFs

max_key = lambda x: x if x else float("-inf")
_get_max_udf = udf(lambda x, y: max(x,y, key=max_key), FloatType())
trends_df = trends_df.withColumn("importance_score", _get_max_udf(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))

This works, but I want to know if there a way I can avoid using the udf and get it done with just spark.

Edit: This is the result of trends_df.printSchema()

 |-- avg_total: struct (nullable = true)
 |    |-- max: struct (nullable = true)
 |    |    |-- avg_percent: double (nullable = true)
 |    |    |-- max_index: long (nullable = true)
 |    |    |-- max_val: long (nullable = true)
 |    |    |-- total_percent: double (nullable = true)
 |    |    |-- total_val: long (nullable = true)
 |    |-- min: struct (nullable = true)
 |    |    |-- avg_percent: double (nullable = true)
 |    |    |-- min_index: long (nullable = true)
 |    |    |-- min_val: long (nullable = true)
 |    |    |-- total_percent: double (nullable = true)
 |    |    |-- total_val: long (nullable = true)

CodePudding user response:

Adding an answer from the comments to highlight it.

As answered by @smurphy I used the greatest function

trends_df = trends_df.withColumn("importance_score", greatest(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))


  • Related