I'd like to get the percentiles of 10%
, 20%
, 30%
... up to 90%
for multiple columns in my DataFrame.
My DataFrame is set like this:
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 |
---|---|---|---|---|
250 | 200 | 100 | 50 | 125 |
50 | 10 | 50 | 10 | 10 |
10 | 20 | 20 | 10 | 50 |
(there are millions of rows)
My desired output would be something like this:
COL | PERCENTILE_10 | PERCENTILE_20 | PERCENTILE_30 | PERCENTILE_40 | PERCENTILE_50 | PERCENTILE_60 | PERCENTILE_70 | PERCENTILE_80 | PERCENTILE_90
col1| 25 | 35 | 45 | 55 | 69 | 100 | 150 | 190 | 250
col2| 26 | 38 | 42 | 50 | 62 | 102 | 153 | 186 | 220
col3| 27 | 42 | 46 | 54 | 60 | 104 | 156 | 194 | 260
col4| 28 | 59 | 70 | 72 | 80 | 106 | 159 | 190 | 290
col5| 40 | 60 | 89 | 93 | 100 | 108 | 162 | 200 | 300
CodePudding user response:
You can loop through each column to calculate percentiles using percentile
or percentile_approx
functions, then union the resulting dfs :
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
percentiles_dfs = []
for c in df.columns:
df1 = df.groupBy(F.lit(c).alias("COL")).agg(*[
F.expr(f"int(percentile({c}, {i * 0.1}))").alias(f"PERCENTILE_{i}0")
for i in range(1, 10)
])
percentiles_dfs.append(df1)
result_df = reduce(DataFrame.union, percentiles_dfs)
result_df.show()
# ---- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- -------------
#| COL|PERCENTILE_10|PERCENTILE_20|PERCENTILE_30|PERCENTILE_40|PERCENTILE_50|PERCENTILE_60|PERCENTILE_70|PERCENTILE_80|PERCENTILE_90|
# ---- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- -------------
#|col1| 18| 26| 34| 42| 50| 89| 129| 170| 210|
#|col2| 12| 14| 16| 18| 20| 55| 91| 128| 164|
#|col3| 26| 32| 38| 44| 50| 60| 70| 80| 90|
#|col4| 10| 10| 10| 10| 10| 18| 25| 34| 42|
#|col5| 18| 26| 34| 42| 50| 65| 79| 95| 110|
# ---- ------------- ------------- ------------- ------------- ------------- ------------- ------------- ------------- -------------