Home > OS >  How to get required dataframe after pyspark pivot?
How to get required dataframe after pyspark pivot?

Time:03-09

I have a spark dataframe:

 ------ ---------------- ------- ------ --------- 
|name  |           pDate| status|user  |userCount|
 ------ ---------------- ------- ------ --------- 
|janani|2022-03-04      |active |qwe123|1        |
|raj   |2022-03-03      |active |qwe123|1        |
|ram   |2022-03-01      |active |qwe123|1        |
|ram   |2022-03-02      |active |qwe123|1        |
 ------ ---------------- ------- ------ --------- 

when I pivoted the dataframe with the following code

pvtcountuser = countuser.groupby('pDate','name').pivot('pDate').max('userCount').fillna(0)

I get:

 ---------------- ------ ---------- ---------- ---------- ---------- 
|pDate           |name  |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
 ---------------- ------ ---------- ---------- ---------- ---------- 
|2022-03-04      |janani|0         |0         |0         |1         |
|2022-03-03      |raj   |0         |0         |1         |0         |
|2022-03-01      |ram   |1         |0         |0         |0         |
|2022-03-02      |ram   |0         |1         |0         |0         |
 ---------------- ------ ---------- ---------- ---------- ---------- 

but the required solution needed is:

 ---------------- ------ ---------- ---------- ---------- ---------- 
|pDate           |name  |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
 ---------------- ------ ---------- ---------- ---------- ---------- 
|2022-03-04      |janani|0         |0         |0         |1         |
|2022-03-03      |raj   |0         |0         |1         |0         |
|2022-03-01      |ram   |1         |1         |0         |0         |
 ---------------- ------ ---------- ---------- ---------- ---------- 

Please help on this

CodePudding user response:

min(pDate) per name, could be achieved prior to the pivot, using windows function.
Once you have it, you can aggregate by both name and min_pDate.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

sql_stmt = '''
select *
from   values    ('janani',date '2022-03-04','active','qwe123',1)
                ,('raj'   ,date '2022-03-03','active','qwe123',1)
                ,('ram'   ,date '2022-03-01','active','qwe123',1)
                ,('ram'   ,date '2022-03-02','active','qwe123',1) as t (name,pDate,status,user,userCount)
'''

countuser = spark.sql(sql_stmt)

pvtcountuser = (countuser
                .withColumn('min_pDate', F.min('pDate').over(Window.partitionBy('name')))
                .groupby('name', 'min_pDate')
                .pivot('pDate')
                .max('userCount')
                .fillna(0)
                .withColumnRenamed('min_pDate','pDate')
               )

pvtcountuser.show()

 ------ ---------- ---------- ---------- ---------- ---------- 
|  name|     pDate|2022-03-01|2022-03-02|2022-03-03|2022-03-04|
 ------ ---------- ---------- ---------- ---------- ---------- 
|janani|2022-03-04|         0|         0|         0|         1|
|   raj|2022-03-03|         0|         0|         1|         0|
|   ram|2022-03-01|         1|         1|         0|         0|
 ------ ---------- ---------- ---------- ---------- ---------- 
  • Related