PySpark : How to aggregate on a column with count of the different-CodePudding

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant	state
ID01	NY
ID02	NY
ID01	CA
ID03	CA
ID01	CA
ID03	NY
ID01	NY
ID01	CA
ID01	NY

I'd like to obtain this dataset:

Identifiant	NY	CA
ID01	3	3
ID02	1	0
ID03	1	1

CodePudding user response：

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
# ----------- --- --- 
#|Identifiant| CA| NY|
# ----------- --- --- 
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
# ----------- --- ---