Home > OS >  How to map one column to multiple binary columns in Spark?
How to map one column to multiple binary columns in Spark?

Time:10-21

This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:

 ----------------- 
| C1     | C2     |
 --------|-------- 
| A      | xxx    |
| B      | yyy    |
| A      | yyy    |
| B      | www    |
| B      | xxx    |
| A      | zzz    |
| A      | xxx    |
| A      | yyy    |
 ----------------- 

to

 -------------------------------------------- 
| C1     | www    | xxx    | yyy    | zzz    |
 --------|--------|--------|--------|--------|
| A      |   0    |   1    |   1    |   1    |
| B      |   1    |   1    |   1    |   0    |
 -------------------------------------------- 

How does one attain this in PySpark? Presence is 1 and absence is 0.

CodePudding user response:

Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('A', 'xxx'),
     ('B', 'yyy'),
     ('A', 'yyy'),
     ('B', 'www'),
     ('B', 'xxx'),
     ('A', 'zzz'),
     ('A', 'xxx'),
     ('A', 'yyy')],
    ['C1', 'C2'])

df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)

df.show()
#  --- --- --- --- --- 
# | C1|www|xxx|yyy|zzz|
#  --- --- --- --- --- 
# |  B|  1|  1|  1|  0|
# |  A|  0|  1|  1|  1|
#  --- --- --- --- --- 
  • Related