I have a pyspark dataFrame that i want to pivot.
input_dataframe:
mdn | top_protocol_by_vol | top_vol | rank |
---|---|---|---|
55555 | AAA | 30 | 1 |
55555 | BBB | 20 | 2 |
55555 | DDD | 10 | 3 |
9898 | JJJ | 30 | 1 |
9898 | CCC | 20 | 2 |
9898 | FFF | 10 | 3 |
2030 | PPP | 30 | 1 |
2030 | KKK | 20 | 2 |
2030 | FFF | 10 | 3 |
and I want to have something like this
output_dataframe:
mdn | top_protocol_by_vol_1 | top_protocol_by_vol_2 | top_protocol_by_vol_3 | top_vol_1 | top_vol_2 | top_vol_3 |
---|---|---|---|---|---|---|
2030 | PPP | KKK | FFF | 30 | 20 | 10 |
9898 | JJJ | CCC | FFF | 30 | 20 | 10 |
55555 | AAA | BBB | DDD | 30 | 20 | 10 |
I know for sure that i cant do soemting like this with Pandas using the code:
output_dataframe = input_dataframe.pivot_table(index='mdn', columns=['rank'],aggfunc=lambda x: ''.join(x) if isinstance(x, str) else x,dropna=True).reset_index()
output_dataframe.columns = [''.join('_'.join([str(c) for c in col if c != ""])) for col in output_dataframe.columns.values]
How can I achieve the same results with pyspark converting to pandas ?
CodePudding user response:
You can use pivot
function with first
as aggregate.
from pyspark.sql import functions as F
df = (df.groupby('mdn')
.pivot('rank')
.agg(F.first('top_protocol_by_vol').alias('top_protocol_by_vol'),
F.first('top_vol').alias('top_vol')))