Home > OS >  Pyspark : how to compute the percentage with condition in dataframe
Pyspark : how to compute the percentage with condition in dataframe

Time:05-31

How to compute the number of performance such that performance = P<=5 ; P>5 & P<=15 ; P>15

address performance = P
NACELLES 589
NACELLES 0
NACELLES 48
NACELLES 318
NACELLES 378
NACELLES 52
NACELLES 45
NACELLES 201
NACELLES 416
NACELLES 29
NACELLES 183
NACELLES 53
NACELLES 7
NACELLES 127
NACELLES 157
NACELLES 248
NACELLES 10
NACELLES 317
NACELLES 2
NACELLES 4

We obtain this dataset

address P<=5 P>5 & P<=15 P> 15
NACELLES 15 % 10 % 75 %

CodePudding user response:

using your dataframe as an example :

 -------- -----------                                                           
| address|performance|
 -------- ----------- 
|NACELLES|        589|
|NACELLES|          0|
|NACELLES|         48|
|NACELLES|        318|

You simply have to aggregate and sum using a when function :

df.groupBy("address").agg(
    (F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
    (
        F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
        / F.count("*")
    ).alias("P>5 & P<=15"),
    (F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()

 -------- ---- ----------- ---- 
| address|P<=5|P>5 & P<=15|P>15|
 -------- ---- ----------- ---- 
|NACELLES|0.15|        0.1|0.75|
 -------- ---- ----------- ---- 
  • Related