This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:
-----------------
| C1 | C2 |
--------|--------
| A | xxx |
| B | yyy |
| A | yyy |
| B | www |
| B | xxx |
| A | zzz |
| A | xxx |
| A | yyy |
-----------------
to
--------------------------------------------
| C1 | www | xxx | yyy | zzz |
--------|--------|--------|--------|--------|
| A | 0 | 1 | 1 | 1 |
| B | 1 | 1 | 1 | 0 |
--------------------------------------------
How does one attain this in PySpark? Presence is 1 and absence is 0.
CodePudding user response:
Yes, you will need pivot
. But for aggregation, in your case it's best just to use F.first(F.lit(1))
and when you get nulls, just replace them with 0
using df.fillna(0)
.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('A', 'xxx'),
('B', 'yyy'),
('A', 'yyy'),
('B', 'www'),
('B', 'xxx'),
('A', 'zzz'),
('A', 'xxx'),
('A', 'yyy')],
['C1', 'C2'])
df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)
df.show()
# --- --- --- --- ---
# | C1|www|xxx|yyy|zzz|
# --- --- --- --- ---
# | B| 1| 1| 1| 0|
# | A| 0| 1| 1| 1|
# --- --- --- --- ---