Home > front end >  How to return the median of a column in a dataset?
How to return the median of a column in a dataset?

Time:07-17

I want to select the median value of one of a dataset column (the median being the value located at the middle of a set of values ranked in an ascending order (for example, alphabetical order for strings)) basically I do it for now like this:

List<Row> listRows = dataset.where(dataset.col(column).isNotNull()).orderBy(column)
                .select(column).collectAsList();
int division = (int) Math.ceil(((double) listRows.size()) / 2);

String medianValue = division % 2 == 0 ? listRows.get(division).getString(0)
                    : listRows.get(division - 1).getString(0);

But i want to avoid using collectAsList() method and listRows var and rather get the value from one single Spark transformation, how to achieve that in Spark?

Example:

 --------- --------- 
| name    | address |
 --------- --------- 
| Patrick | NYC     |
 --------- --------- 
| Mel     | SF      |
 --------- --------- 
| John    | TX      |
 --------- --------- 

if the target column is "name" then I want the value Mel to be returned since the ordering of "name" column will be like (John then Mel then Patrick (alphabetical order)) then the median is Mel. PS: i work in Java, but a solution with PySpark or Scala are also welcome. Thanks.

CodePudding user response:

I think it would be acceptable to:

  1. Create a row number column (using monotonically increasing id probably)
  2. Cache the dataset (so you can count and find the median without having to re-evaluate the whole plan)
  3. Count the dataset
  4. Find the median
  5. Filter the dataset where row number equals to your found median

Or in code, something like this:

listRows = listRows.withColumn("id", expr("ROW_NUMBER() OVER (ORDER BY name)"));

val count = listRows.cache().count();
val median = count / 2; // and some more operations if you are interested, like rounding

listRows.where(col("id") === median) // returns the row with median

Another workaround can be:

  1. Create a row number column
  2. Find the maximum row number column
  3. Create another column that calculates your median from step 1 and 2
  4. Filter where row number equals the calculation in step 3
  • Related