How to map one column to multiple binary columns in Spark?-CodePudding

This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:

 ----------------- 
| C1     | C2     |
 --------|-------- 
| A      | xxx    |
| B      | yyy    |
| A      | yyy    |
| B      | www    |
| B      | xxx    |
| A      | zzz    |
| A      | xxx    |
| A      | yyy    |
 -----------------

 -------------------------------------------- 
| C1     | www    | xxx    | yyy    | zzz    |
 --------|--------|--------|--------|--------|
| A      |   0    |   1    |   1    |   1    |
| B      |   1    |   1    |   1    |   0    |
 --------------------------------------------

How does one attain this in PySpark? Presence is 1 and absence is 0.

CodePudding user response：

Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('A', 'xxx'),
     ('B', 'yyy'),
     ('A', 'yyy'),
     ('B', 'www'),
     ('B', 'xxx'),
     ('A', 'zzz'),
     ('A', 'xxx'),
     ('A', 'yyy')],
    ['C1', 'C2'])

df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)

df.show()
#  --- --- --- --- --- 
# | C1|www|xxx|yyy|zzz|
#  --- --- --- --- --- 
# |  B|  1|  1|  1|  0|
# |  A|  0|  1|  1|  1|
#  --- --- --- --- ---