I have a dataframe of title
and bin
:
--------------------- -------------
| Title| bin|
--------------------- -------------
| Forrest Gump (1994)| 3|
| Pulp Fiction (1994)| 2|
| Matrix, The (1999)| 3|
| Toy Story (1995)| 1|
| Fight Club (1999)| 3|
--------------------- -------------
How do I count the bin
into each individual column of a new dataframe using Pyspark? For instance:
------------ ------------ ------------
| count(bin1)| count(bin2)| count(bin3)|
------------ ------------ ------------
| 1| 1 | 3|
------------ ------------ ------------
Is this possible? Would someone please help me with this if you know how?
CodePudding user response:
Group by bin
and count then pivot the column bin
and rename the columns of resulting dataframe if you want:
import pyspark.sql.functions as F
df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))
df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])
df1.show()
# ---------- ---------- ----------
#|count_bin1|count_bin2|count_bin3|
# ---------- ---------- ----------
#| 1| 1| 3|
# ---------- ---------- ----------