I have a pyspark dataframe with the following schema:
Key1 | Key2 | Key3 | Value |
---|---|---|---|
a | a | a | "value1" |
a | a | a | "value2" |
a | a | b | "value1" |
b | b | a | "value2" |
(In real life this dataframe is extremely large, not reasonable to convert to pandas DF)
My goal is to transform the dataframe to look like so:
Key1 | Key2 | Key3 | value1 | value2 |
---|---|---|---|---|
a | a | a | 1 | 1 |
a | a | b | 1 | 0 |
b | b | a | 0 | 1 |
I know this is possible in pandas using the get_dummies function and I have also seen that there is some sort of pyspark & pandas hybrid function that I am not sure I can use.
It is worth mentioning that column Value
can receive (in this example) only the values "value1"
and "value2"
I have encountered this question that possibly solves my problem but I do not entirely understand it and was wondering if there was a simpler way to solve the problem.
Any help is greatly appreciated!
SMALL EDIT
After implementing the accepted solution, to turn this into a one-hot encoding and not just a sum of appearances, I converted each column to boolean type and then back to integer.
CodePudding user response:
You can group by on the key columns and pivot the value column while counting all records.
data_sdf. \
groupBy('key1', 'key2', 'key3'). \
pivot('val'). \
agg(func.count('*')). \
fillna(0). \
show()
# ---- ---- ---- ------ ------
# |key1|key2|key3|value1|value2|
# ---- ---- ---- ------ ------
# | b| b| a| 0| 1|
# | a| a| a| 1| 1|
# | a| a| b| 1| 0|
# ---- ---- ---- ------ ------
CodePudding user response:
This can be achieved by group by twice.
df = df.groupby(*df.columns).agg(F.count('*').alias('cnt')) \
.groupby('Key1', 'Key2', 'Key3').pivot('Value').agg(F.sum('cnt')).fillna(0)
df.show(truncate=False)
# ---- ---- ---- ------ ------
# |Key1|Key2|Key3|value1|value2|
# ---- ---- ---- ------ ------
# |a |a |b |1 |0 |
# |b |b |a |0 |1 |
# |a |a |a |1 |1 |
# ---- ---- ---- ------ ------