Home > database >  PySpark : How to aggregate on a column with count of the different
PySpark : How to aggregate on a column with count of the different

Time:02-11

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant state
ID01 NY
ID02 NY
ID01 CA
ID03 CA
ID01 CA
ID03 NY
ID01 NY
ID01 CA
ID01 NY

I'd like to obtain this dataset:

Identifiant NY CA
ID01 3 3
ID02 1 0
ID03 1 1

CodePudding user response:

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
# ----------- --- --- 
#|Identifiant| CA| NY|
# ----------- --- --- 
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
# ----------- --- --- 
  • Related