Home > OS >  How can I use the equivalent of pandas pivot_table() in pyspark?
How can I use the equivalent of pandas pivot_table() in pyspark?

Time:12-20

I have a pyspark dataFrame that i want to pivot.

input_dataframe:

mdn top_protocol_by_vol top_vol rank
55555 AAA 30 1
55555 BBB 20 2
55555 DDD 10 3
9898 JJJ 30 1
9898 CCC 20 2
9898 FFF 10 3
2030 PPP 30 1
2030 KKK 20 2
2030 FFF 10 3

and I want to have something like this

output_dataframe:

mdn top_protocol_by_vol_1 top_protocol_by_vol_2 top_protocol_by_vol_3 top_vol_1 top_vol_2 top_vol_3
2030 PPP KKK FFF 30 20 10
9898 JJJ CCC FFF 30 20 10
55555 AAA BBB DDD 30 20 10

I know for sure that i cant do soemting like this with Pandas using the code:

output_dataframe = input_dataframe.pivot_table(index='mdn', columns=['rank'],aggfunc=lambda x: ''.join(x) if isinstance(x, str) else x,dropna=True).reset_index()
output_dataframe.columns = [''.join('_'.join([str(c) for c in col if c != ""])) for col in output_dataframe.columns.values]

How can I achieve the same results with pyspark converting to pandas ?

CodePudding user response:

You can use pivot function with first as aggregate.

from pyspark.sql import functions as F

df = (df.groupby('mdn')
      .pivot('rank')
      .agg(F.first('top_protocol_by_vol').alias('top_protocol_by_vol'),
           F.first('top_vol').alias('top_vol')))
  • Related