I'm understanding Spark pretty well conceptually but I'm still navigating the specific syntax options for expressions. I'm using Spark 3.3.0 with Java 17.
I have a table with a column foo
that shouldn't have values much higher than 30. I want to add a column with the proportion of foo
with 30
, that is, the value of foo
normalized to 0.0 <= proportion <= 1.0
with a cutoff of 30
for foo
. Finally foo
is a decimal type, represented in Java by BigDecimal
.
I might make this calculation like this in Java:
BigDecimal maxVal = new BigDecimal(30);
BigDecimal proportion = foo.max(maxVal).divide(maxVal);
(Or I could first divide by maxVal
and take the max of the result and 1.0
, although the first approach above is probably more efficient and more direct semantically.)
How would I best do this in Spark? The following seems to work and gets me most of the way there:
BigDecimal maxVal = new BigDecimal(30);
df = df.withColumn("proportion", col("foo").divide(maxVal));
But what about the max()
part so that I can cut off the value at 30
? The only "max" function I could find offhand was the Spark SQL MAX()
aggregate function.
CodePudding user response:
While min and max are aggregation functions that work on a group of rows, the functions least and greatest can compare values within a row or with a constant value.
import static org.apache.spark.sql.functions.*;
Dataset<Row> df = ...
BigDecimal maxVal=new BigDecimal(30);
df.withColumn("proportion", greatest(col("value"), lit(maxVal)).divide(maxVal)).show();
Output:
----- ----------
|value|proportion|
----- ----------
| 1| 1.000000|
| 10| 1.000000|
| 20| 1.000000|
| 30| 1.000000|
| 31| 1.033333|
| 40| 1.333333|
----- ----------
If the values should be capped at maxValue
one would use least
instead of greatest
(this would correspond to using BigDecimal.min in the non-Spark code):
df.withColumn("proportion", least(col("value"), lit(maxVal)).divide(maxVal)).show();
Output now:
----- ----------
|value|proportion|
----- ----------
| 1| 0.033333|
| 10| 0.333333|
| 20| 0.666667|
| 30| 1.000000|
| 31| 1.000000|
| 40| 1.000000|
----- ----------