Home > Enterprise >  Pyspark - Apply groupBy aggregations on different group levels
Pyspark - Apply groupBy aggregations on different group levels

Time:02-24

I have the following pyspark dataframe:

 --------- ----- 
|      Day|Sunny|
 --------- ----- 
|   Sunday|  Yes|
|   Sunday|   No|
|   Monday|  Yes|
|   Monday|   No|
|  Tuesday|  Yes|
|  Tuesday|  Yes|
|  Tuesday|   No|
|  Tuesday|   No|
|  Tuesday|   No|
|Wednesday|  Yes|
|Wednesday|  Yes|
|Wednesday|  Yes|
|Wednesday|  Yes|
|Wednesday|   No|
| Thursday|  Yes|
| Thursday|  Yes|
| Thursday|   No|
| Thursday|   No|
|   Friday|   No|
|   Friday|   No|
|   Friday|  Yes|
| Saturday|  Yes|
| Saturday|  Yes|
| Saturday|   No|
 --------- ----- 

I did a groupBy by the two columns Day and Sunny and then count, to get the count of each ("Day", "Sunny") pair:

 --------- ----- ----- 
|      Day|Sunny|count|
 --------- ----- ----- 
|   Friday|   No|    2|
|   Friday|  Yes|    1|
|   Monday|  Yes|    1|
|   Monday|   No|    1|
| Saturday|   No|    1|
| Saturday|  Yes|    2|
|   Sunday|   No|    1|
|   Sunday|  Yes|    1|
| Thursday|  Yes|    2|
| Thursday|   No|    2|
|  Tuesday|   No|    3|
|  Tuesday|  Yes|    2|
|Wednesday|  Yes|    4|
|Wednesday|   No|    1|
 --------- ----- ----- 

My question is how do I get the following dataframe, that is adding a column of the percentage of each Sunny value count in each Day group:

 --------- ----- ----- --------------------- 
|      Day|Sunny|count|                     |
 --------- ----- ----- --------------------- 
|   Friday|   No|    2| 66% of Friday       |
|   Friday|  Yes|    1| 33% of Friday       |
|   Monday|  Yes|    1| 50% of Monday       |
|   Monday|   No|    1| 50% of Monday       |
| Saturday|   No|    1| 33% of Saturday     |
| Saturday|  Yes|    2| 66% of Saturday     |
|   Sunday|   No|    1| 50% of Sunday       |
|   Sunday|  Yes|    1| 50% of Sunday       |
| Thursday|  Yes|    2| 50% of Thursday     |
| Thursday|   No|    2| 50% of Thursday     |
|  Tuesday|   No|    3| 60% of Tuesday      |
|  Tuesday|  Yes|    2| 40% of Tuesday      |
|Wednesday|  Yes|    4| 80% of Wednesday    |
|Wednesday|   No|    1| 20% of Wednesday    |
 --------- ----- ----- --------------------- 

The code for creating the dataframe is:

df = spark.createDataFrame(
    [
        ("Sunday", "Yes"),
        ("Sunday", "No"),
        ("Monday", "Yes"),
        ("Monday", "No"),
        ("Tuesday", "Yes"),
        ("Tuesday", "Yes"),
        ("Tuesday", "No"),
        ("Tuesday", "No"),
        ("Tuesday", "No"),
        ("Wednesday", "Yes"),
        ("Wednesday", "Yes"),
        ("Wednesday", "Yes"),
        ("Wednesday", "Yes"),
        ("Wednesday", "No"),
        ("Thursday", "Yes"),
        ("Thursday", "Yes"),
        ("Thursday", "No"),
        ("Thursday", "No"),
        ("Friday", "No"),
        ("Friday", "No"),
        ("Friday", "Yes"),
        ("Saturday", "Yes"),
        ("Saturday", "Yes"),
        ("Saturday", "No"),
    ],
    ["Day", "Sunny"]

)


df.groupBy(["Day", "Sunny"]).count().sort("Day").show()

CodePudding user response:

You need the total number of days (= "count of (Fri, Yes)" "count of (Fri, No)") to calculate the percentage, so you can get sum partitioned by Day.

w = Window.partitionBy('Day')
df = df.withColumn('percentage', F.col('count') / F.sum(F.col('count')).over(w) * 100)

If you only want the integer, use floor/round to take out floating point.

df = df.withColumn('percentage', F.floor(F.col('count') / F.sum(F.col('count')).over(w) * 100))
  • Related