Home > Software engineering >  Spark Java use math operations to get value proportion with max cutoff
Spark Java use math operations to get value proportion with max cutoff

Time:09-07

I'm understanding Spark pretty well conceptually but I'm still navigating the specific syntax options for expressions. I'm using Spark 3.3.0 with Java 17.

I have a table with a column foo that shouldn't have values much higher than 30. I want to add a column with the proportion of foo with 30, that is, the value of foo normalized to 0.0 <= proportion <= 1.0 with a cutoff of 30 for foo. Finally foo is a decimal type, represented in Java by BigDecimal.

I might make this calculation like this in Java:

BigDecimal maxVal = new BigDecimal(30); 
BigDecimal proportion = foo.max(maxVal).divide(maxVal);

(Or I could first divide by maxVal and take the max of the result and 1.0, although the first approach above is probably more efficient and more direct semantically.)

How would I best do this in Spark? The following seems to work and gets me most of the way there:

BigDecimal maxVal = new BigDecimal(30);
df = df.withColumn("proportion", col("foo").divide(maxVal));

But what about the max() part so that I can cut off the value at 30? The only "max" function I could find offhand was the Spark SQL MAX() aggregate function.

CodePudding user response:

While min and max are aggregation functions that work on a group of rows, the functions least and greatest can compare values within a row or with a constant value.

import static org.apache.spark.sql.functions.*;

Dataset<Row> df = ...
BigDecimal maxVal=new BigDecimal(30);
df.withColumn("proportion", greatest(col("value"), lit(maxVal)).divide(maxVal)).show();

Output:

 ----- ---------- 
|value|proportion|
 ----- ---------- 
|    1|  1.000000|
|   10|  1.000000|
|   20|  1.000000|
|   30|  1.000000|
|   31|  1.033333|
|   40|  1.333333|
 ----- ---------- 

If the values should be capped at maxValue one would use least instead of greatest (this would correspond to using BigDecimal.min in the non-Spark code):

df.withColumn("proportion", least(col("value"), lit(maxVal)).divide(maxVal)).show();

Output now:

 ----- ---------- 
|value|proportion|
 ----- ---------- 
|    1|  0.033333|
|   10|  0.333333|
|   20|  0.666667|
|   30|  1.000000|
|   31|  1.000000|
|   40|  1.000000|
 ----- ---------- 
  • Related