Home > Net >  How do I get multiple percentiles for multiple columns in PySpark
How do I get multiple percentiles for multiple columns in PySpark

Time:11-28

I'd like to get the percentiles of 10%, 20%, 30%... up to 90% for multiple columns in my DataFrame.

My DataFrame is set like this:

Col 1 Col 2 Col 3 Col 4 Col 5
250 200 100 50 125
50 10 50 10 10
10 20 20 10 50

(there are millions of rows)

My desired output would be something like this:

COL | PERCENTILE_10 | PERCENTILE_20 | PERCENTILE_30 | PERCENTILE_40 | PERCENTILE_50 | PERCENTILE_60 | PERCENTILE_70 | PERCENTILE_80 | PERCENTILE_90
col1| 25            |       35      |       45      |      55       |     69        | 100           |   150         |   190         |   250         
col2| 26            |       38      |       42      |      50       |     62        | 102           |   153         |   186         |   220     
col3| 27            |       42      |       46      |      54       |     60        | 104           |   156         |   194         |   260     
col4| 28            |       59      |       70      |      72       |     80        | 106           |   159         |   190         |   290     
col5| 40            |       60      |       89      |      93       |     100       | 108           |   162         |   200         |   300     

CodePudding user response:

You can loop through each column to calculate percentiles using percentile or percentile_approx functions, then union the resulting dfs :

from functools import reduce

import pyspark.sql.functions as F
from pyspark.sql import DataFrame

percentiles_dfs = []

for c in df.columns:
    df1 = df.groupBy(F.lit(c).alias("COL")).agg(*[
        F.expr(f"int(percentile({c}, {i * 0.1}))").alias(f"PERCENTILE_{i}0")
        for i in range(1, 10)
    ])

    percentiles_dfs.append(df1)

result_df = reduce(DataFrame.union, percentiles_dfs)

result_df.show()

# ---- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- 
#| COL|PERCENTILE_10|PERCENTILE_20|PERCENTILE_30|PERCENTILE_40|PERCENTILE_50|PERCENTILE_60|PERCENTILE_70|PERCENTILE_80|PERCENTILE_90|
# ---- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- 
#|col1|           18|           26|           34|           42|           50|           89|          129|          170|          210|
#|col2|           12|           14|           16|           18|           20|           55|           91|          128|          164|
#|col3|           26|           32|           38|           44|           50|           60|           70|           80|           90|
#|col4|           10|           10|           10|           10|           10|           18|           25|           34|           42|
#|col5|           18|           26|           34|           42|           50|           65|           79|           95|          110|
# ---- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- 
  • Related