How to create multiple count columns in Pyspark?-CodePudding

I have a dataframe of title and bin:

 --------------------- ------------- 
|                Title|          bin|        
 --------------------- ------------- 
|  Forrest Gump (1994)|            3|
|  Pulp Fiction (1994)|            2|
|   Matrix, The (1999)|            3|
|     Toy Story (1995)|            1|                     
|    Fight Club (1999)|            3|
 --------------------- -------------

How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:

 ------------ ------------ ------------ 
| count(bin1)| count(bin2)| count(bin3)|      
 ------------ ------------ ------------ 
|           1|          1 |           3|
 ------------ ------------ ------------

Is this possible? Would someone please help me with this if you know how?

CodePudding user response：

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:

import pyspark.sql.functions as F

df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))

df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])

df1.show()

# ---------- ---------- ---------- 
#|count_bin1|count_bin2|count_bin3|
# ---------- ---------- ---------- 
#|         1|         1|         3|
# ---------- ---------- ----------