I want to select the median value of one of a dataset column (the median being the value located at the middle of a set of values ranked in an ascending order (for example, alphabetical order for strings)) basically I do it for now like this:
List<Row> listRows = dataset.where(dataset.col(column).isNotNull()).orderBy(column)
.select(column).collectAsList();
int division = (int) Math.ceil(((double) listRows.size()) / 2);
String medianValue = division % 2 == 0 ? listRows.get(division).getString(0)
: listRows.get(division - 1).getString(0);
But i want to avoid using collectAsList()
method and listRows
var and rather get the value from one single Spark transformation, how to achieve that in Spark?
Example:
--------- ---------
| name | address |
--------- ---------
| Patrick | NYC |
--------- ---------
| Mel | SF |
--------- ---------
| John | TX |
--------- ---------
if the target column is "name" then I want the value Mel
to be returned since the ordering of "name" column will be like (John
then Mel
then Patrick
(alphabetical order)) then the median is Mel
.
PS: i work in Java, but a solution with PySpark or Scala are also welcome.
Thanks.
CodePudding user response:
I think it would be acceptable to:
- Create a row number column (using monotonically increasing id probably)
- Cache the dataset (so you can count and find the median without having to re-evaluate the whole plan)
- Count the dataset
- Find the median
- Filter the dataset where row number equals to your found median
Or in code, something like this:
listRows = listRows.withColumn("id", expr("ROW_NUMBER() OVER (ORDER BY name)"));
val count = listRows.cache().count();
val median = count / 2; // and some more operations if you are interested, like rounding
listRows.where(col("id") === median) // returns the row with median
Another workaround can be:
- Create a row number column
- Find the maximum row number column
- Create another column that calculates your median from step 1 and 2
- Filter where row number equals the calculation in step 3