I have the following pyspark dataframe:
--------- -----
| Day|Sunny|
--------- -----
| Sunday| Yes|
| Sunday| No|
| Monday| Yes|
| Monday| No|
| Tuesday| Yes|
| Tuesday| Yes|
| Tuesday| No|
| Tuesday| No|
| Tuesday| No|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| Yes|
|Wednesday| No|
| Thursday| Yes|
| Thursday| Yes|
| Thursday| No|
| Thursday| No|
| Friday| No|
| Friday| No|
| Friday| Yes|
| Saturday| Yes|
| Saturday| Yes|
| Saturday| No|
--------- -----
I did a groupBy by the two columns Day
and Sunny
and then count, to get the count of each ("Day", "Sunny") pair:
--------- ----- -----
| Day|Sunny|count|
--------- ----- -----
| Friday| No| 2|
| Friday| Yes| 1|
| Monday| Yes| 1|
| Monday| No| 1|
| Saturday| No| 1|
| Saturday| Yes| 2|
| Sunday| No| 1|
| Sunday| Yes| 1|
| Thursday| Yes| 2|
| Thursday| No| 2|
| Tuesday| No| 3|
| Tuesday| Yes| 2|
|Wednesday| Yes| 4|
|Wednesday| No| 1|
--------- ----- -----
My question is how do I get the following dataframe, that is adding a column of the percentage of each Sunny
value count in each Day
group:
--------- ----- ----- ---------------------
| Day|Sunny|count| |
--------- ----- ----- ---------------------
| Friday| No| 2| 66% of Friday |
| Friday| Yes| 1| 33% of Friday |
| Monday| Yes| 1| 50% of Monday |
| Monday| No| 1| 50% of Monday |
| Saturday| No| 1| 33% of Saturday |
| Saturday| Yes| 2| 66% of Saturday |
| Sunday| No| 1| 50% of Sunday |
| Sunday| Yes| 1| 50% of Sunday |
| Thursday| Yes| 2| 50% of Thursday |
| Thursday| No| 2| 50% of Thursday |
| Tuesday| No| 3| 60% of Tuesday |
| Tuesday| Yes| 2| 40% of Tuesday |
|Wednesday| Yes| 4| 80% of Wednesday |
|Wednesday| No| 1| 20% of Wednesday |
--------- ----- ----- ---------------------
The code for creating the dataframe is:
df = spark.createDataFrame(
[
("Sunday", "Yes"),
("Sunday", "No"),
("Monday", "Yes"),
("Monday", "No"),
("Tuesday", "Yes"),
("Tuesday", "Yes"),
("Tuesday", "No"),
("Tuesday", "No"),
("Tuesday", "No"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "Yes"),
("Wednesday", "No"),
("Thursday", "Yes"),
("Thursday", "Yes"),
("Thursday", "No"),
("Thursday", "No"),
("Friday", "No"),
("Friday", "No"),
("Friday", "Yes"),
("Saturday", "Yes"),
("Saturday", "Yes"),
("Saturday", "No"),
],
["Day", "Sunny"]
)
df.groupBy(["Day", "Sunny"]).count().sort("Day").show()
CodePudding user response:
You need the total number of days (= "count of (Fri, Yes)" "count of (Fri, No)"
) to calculate the percentage, so you can get sum
partitioned by Day
.
w = Window.partitionBy('Day')
df = df.withColumn('percentage', F.col('count') / F.sum(F.col('count')).over(w) * 100)
If you only want the integer, use floor/round to take out floating point.
df = df.withColumn('percentage', F.floor(F.col('count') / F.sum(F.col('count')).over(w) * 100))