How to get required dataframe after pyspark pivot?-CodePudding

I have a spark dataframe:

 ------ ---------------- ------- ------ --------- 
|name  |           pDate| status|user  |userCount|
 ------ ---------------- ------- ------ --------- 
|janani|2022-03-04      |active |qwe123|1        |
|raj   |2022-03-03      |active |qwe123|1        |
|ram   |2022-03-01      |active |qwe123|1        |
|ram   |2022-03-02      |active |qwe123|1        |
 ------ ---------------- ------- ------ ---------

when I pivoted the dataframe with the following code

pvtcountuser = countuser.groupby('pDate','name').pivot('pDate').max('userCount').fillna(0)

I get:

 ---------------- ------ ---------- ---------- ---------- ---------- 
|pDate           |name  |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
 ---------------- ------ ---------- ---------- ---------- ---------- 
|2022-03-04      |janani|0         |0         |0         |1         |
|2022-03-03      |raj   |0         |0         |1         |0         |
|2022-03-01      |ram   |1         |0         |0         |0         |
|2022-03-02      |ram   |0         |1         |0         |0         |
 ---------------- ------ ---------- ---------- ---------- ----------

but the required solution needed is:

 ---------------- ------ ---------- ---------- ---------- ---------- 
|pDate           |name  |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
 ---------------- ------ ---------- ---------- ---------- ---------- 
|2022-03-04      |janani|0         |0         |0         |1         |
|2022-03-03      |raj   |0         |0         |1         |0         |
|2022-03-01      |ram   |1         |1         |0         |0         |
 ---------------- ------ ---------- ---------- ---------- ----------

Please help on this

CodePudding user response：

min(pDate) per name, could be achieved prior to the pivot, using windows function.
Once you have it, you can aggregate by both name and min_pDate.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

sql_stmt = '''
select *
from   values    ('janani',date '2022-03-04','active','qwe123',1)
                ,('raj'   ,date '2022-03-03','active','qwe123',1)
                ,('ram'   ,date '2022-03-01','active','qwe123',1)
                ,('ram'   ,date '2022-03-02','active','qwe123',1) as t (name,pDate,status,user,userCount)
'''

countuser = spark.sql(sql_stmt)

pvtcountuser = (countuser
                .withColumn('min_pDate', F.min('pDate').over(Window.partitionBy('name')))
                .groupby('name', 'min_pDate')
                .pivot('pDate')
                .max('userCount')
                .fillna(0)
                .withColumnRenamed('min_pDate','pDate')
               )

pvtcountuser.show()

 ------ ---------- ---------- ---------- ---------- ---------- 
|  name|     pDate|2022-03-01|2022-03-02|2022-03-03|2022-03-04|
 ------ ---------- ---------- ---------- ---------- ---------- 
|janani|2022-03-04|         0|         0|         0|         1|
|   raj|2022-03-03|         0|         0|         1|         0|
|   ram|2022-03-01|         1|         1|         0|         0|
 ------ ---------- ---------- ---------- ---------- ----------