I have a spark dataframe:
------ ---------------- ------- ------ ---------
|name | pDate| status|user |userCount|
------ ---------------- ------- ------ ---------
|janani|2022-03-04 |active |qwe123|1 |
|raj |2022-03-03 |active |qwe123|1 |
|ram |2022-03-01 |active |qwe123|1 |
|ram |2022-03-02 |active |qwe123|1 |
------ ---------------- ------- ------ ---------
when I pivoted the dataframe with the following code
pvtcountuser = countuser.groupby('pDate','name').pivot('pDate').max('userCount').fillna(0)
I get:
---------------- ------ ---------- ---------- ---------- ----------
|pDate |name |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
---------------- ------ ---------- ---------- ---------- ----------
|2022-03-04 |janani|0 |0 |0 |1 |
|2022-03-03 |raj |0 |0 |1 |0 |
|2022-03-01 |ram |1 |0 |0 |0 |
|2022-03-02 |ram |0 |1 |0 |0 |
---------------- ------ ---------- ---------- ---------- ----------
but the required solution needed is:
---------------- ------ ---------- ---------- ---------- ----------
|pDate |name |2022-03-01|2022-03-02|2022-03-03|2022-03-04|
---------------- ------ ---------- ---------- ---------- ----------
|2022-03-04 |janani|0 |0 |0 |1 |
|2022-03-03 |raj |0 |0 |1 |0 |
|2022-03-01 |ram |1 |1 |0 |0 |
---------------- ------ ---------- ---------- ---------- ----------
Please help on this
CodePudding user response:
min(pDate) per name, could be achieved prior to the pivot, using windows function.
Once you have it, you can aggregate by both name and min_pDate.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
sql_stmt = '''
select *
from values ('janani',date '2022-03-04','active','qwe123',1)
,('raj' ,date '2022-03-03','active','qwe123',1)
,('ram' ,date '2022-03-01','active','qwe123',1)
,('ram' ,date '2022-03-02','active','qwe123',1) as t (name,pDate,status,user,userCount)
'''
countuser = spark.sql(sql_stmt)
pvtcountuser = (countuser
.withColumn('min_pDate', F.min('pDate').over(Window.partitionBy('name')))
.groupby('name', 'min_pDate')
.pivot('pDate')
.max('userCount')
.fillna(0)
.withColumnRenamed('min_pDate','pDate')
)
pvtcountuser.show()
------ ---------- ---------- ---------- ---------- ----------
| name| pDate|2022-03-01|2022-03-02|2022-03-03|2022-03-04|
------ ---------- ---------- ---------- ---------- ----------
|janani|2022-03-04| 0| 0| 0| 1|
| raj|2022-03-03| 0| 0| 1| 0|
| ram|2022-03-01| 1| 1| 0| 0|
------ ---------- ---------- ---------- ---------- ----------