Home > front end >  How to create multiple count columns in Pyspark?
How to create multiple count columns in Pyspark?

Time:01-16

I have a dataframe of title and bin:

 --------------------- ------------- 
|                Title|          bin|        
 --------------------- ------------- 
|  Forrest Gump (1994)|            3|
|  Pulp Fiction (1994)|            2|
|   Matrix, The (1999)|            3|
|     Toy Story (1995)|            1|                     
|    Fight Club (1999)|            3|
 --------------------- ------------- 

How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:

 ------------ ------------ ------------ 
| count(bin1)| count(bin2)| count(bin3)|      
 ------------ ------------ ------------ 
|           1|          1 |           3|
 ------------ ------------ ------------ 

Is this possible? Would someone please help me with this if you know how?

CodePudding user response:

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:

import pyspark.sql.functions as F

df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))

df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])

df1.show()

# ---------- ---------- ---------- 
#|count_bin1|count_bin2|count_bin3|
# ---------- ---------- ---------- 
#|         1|         1|         3|
# ---------- ---------- ---------- 
  •  Tags:  
  • Related