How to compute the number of performance such that performance = P<=5 ; P>5 & P<=15 ; P>15
address | performance = P |
---|---|
NACELLES | 589 |
NACELLES | 0 |
NACELLES | 48 |
NACELLES | 318 |
NACELLES | 378 |
NACELLES | 52 |
NACELLES | 45 |
NACELLES | 201 |
NACELLES | 416 |
NACELLES | 29 |
NACELLES | 183 |
NACELLES | 53 |
NACELLES | 7 |
NACELLES | 127 |
NACELLES | 157 |
NACELLES | 248 |
NACELLES | 10 |
NACELLES | 317 |
NACELLES | 2 |
NACELLES | 4 |
We obtain this dataset
address | P<=5 | P>5 & P<=15 | P> 15 |
---|---|---|---|
NACELLES | 15 % | 10 % | 75 % |
CodePudding user response:
using your dataframe as an example :
-------- -----------
| address|performance|
-------- -----------
|NACELLES| 589|
|NACELLES| 0|
|NACELLES| 48|
|NACELLES| 318|
You simply have to aggregate and sum using a when function :
df.groupBy("address").agg(
(F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
(
F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
/ F.count("*")
).alias("P>5 & P<=15"),
(F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()
-------- ---- ----------- ----
| address|P<=5|P>5 & P<=15|P>15|
-------- ---- ----------- ----
|NACELLES|0.15| 0.1|0.75|
-------- ---- ----------- ----