I have a pyspark dataframe with the following schema:

Key1	Key2	Key3	Value
a	a	a	"value1"
a	a	a	"value2"
a	a	b	"value1"
b	b	a	"value2"

(In real life this dataframe is extremely large, not reasonable to convert to pandas DF)

My goal is to transform the dataframe to look like so:

Key1	Key2	Key3	value1	value2
a	a	a	1	1
a	a	b	1	0
b	b	a	0	1

I know this is possible in pandas using the get_dummies function and I have also seen that there is some sort of pyspark & pandas hybrid function that I am not sure I can use.

It is worth mentioning that column Value can receive (in this example) only the values "value1" and "value2" I have encountered this question that possibly solves my problem but I do not entirely understand it and was wondering if there was a simpler way to solve the problem.
Any help is greatly appreciated!

SMALL EDIT

After implementing the accepted solution, to turn this into a one-hot encoding and not just a sum of appearances, I converted each column to boolean type and then back to integer.

CodePudding user response：

You can group by on the key columns and pivot the value column while counting all records.

data_sdf. \
    groupBy('key1', 'key2', 'key3'). \
    pivot('val'). \
    agg(func.count('*')). \
    fillna(0). \
    show()

#  ---- ---- ---- ------ ------ 
# |key1|key2|key3|value1|value2|
#  ---- ---- ---- ------ ------ 
# |   b|   b|   a|     0|     1|
# |   a|   a|   a|     1|     1|
# |   a|   a|   b|     1|     0|
#  ---- ---- ---- ------ ------

CodePudding user response：

This can be achieved by group by twice.

df = df.groupby(*df.columns).agg(F.count('*').alias('cnt')) \
    .groupby('Key1', 'Key2', 'Key3').pivot('Value').agg(F.sum('cnt')).fillna(0)
df.show(truncate=False)

#  ---- ---- ---- ------ ------ 
# |Key1|Key2|Key3|value1|value2|
#  ---- ---- ---- ------ ------ 
# |a   |a   |b   |1     |0     |
# |b   |b   |a   |0     |1     |
# |a   |a   |a   |1     |1     |
#  ---- ---- ---- ------ ------