I want to aggregate on the Identifiant
column with count of different state
and represent all the state
.
Identifiant | state |
---|---|
ID01 | NY |
ID02 | NY |
ID01 | CA |
ID03 | CA |
ID01 | CA |
ID03 | NY |
ID01 | NY |
ID01 | CA |
ID01 | NY |
I'd like to obtain this dataset:
Identifiant | NY | CA |
---|---|---|
ID01 | 3 | 3 |
ID02 | 1 | 0 |
ID03 | 1 | 1 |
CodePudding user response:
Group by Identifiant
and pivot State
column:
from pyspark.sql import functions as F
result = (df.groupBy("Identifiant")
.pivot("State")
.count().na.fill(0)
)
result.show()
# ----------- --- ---
#|Identifiant| CA| NY|
# ----------- --- ---
#| ID03| 1| 1|
#| ID01| 3| 3|
#| ID02| 0| 1|
# ----------- --- ---