Home > Mobile >  Create multiple columns by pivoting even when pivoted value doesn't exist
Create multiple columns by pivoting even when pivoted value doesn't exist

Time:06-29

I have a PySpark df:

Store_ID Category ID Sales
1 A 123 23
2 A 123 45
1 A 234 67
1 B 567 78
2 B 567 34
3 D 789 12
1 A 890 12

Expected:

Store_ID A_ID B_ID C_ID D_ID Sales_A Sales_B Sales_C Sales_D
1 3 1 0 0 102 78 0 0
2 1 1 0 0 45 34 0 0
3 0 0 0 1 0 0 0 12

I am able to transform this way using SQL (created a temp view):

SELECT Store_Id,
       SUM(IF(Category='A',Sales,0)) AS Sales_A,
       SUM(IF(Category='B',Sales,0)) AS Sales_B,
       SUM(IF(Category='C',Sales,0)) AS Sales_C,
       SUM(IF(Category='D',Sales,0)) AS Sales_D,
       COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
       COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
       COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
       COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;

How do we achieve the same in PySpark using native functions as its much faster?

CodePudding user response:

This operation is called pivoting.

  • a couple of aggregations, since you need both, count of ID and sum of Sales
  • alias for aggregations, for changing column names
  • providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'A', 123, 23),
     (2, 'A', 123, 45),
     (1, 'A', 234, 67),
     (1, 'B', 567, 78),
     (2, 'B', 567, 34),
     (3, 'D', 789, 12),
     (1, 'A', 890, 12)],
    ['Store_ID', 'Category', 'ID', 'Sales'])

Script:

df = (df
    .groupBy('Store_ID')
    .pivot('Category', ['A', 'B', 'C', 'D'])
    .agg(
        F.countDistinct('ID').alias('ID'),
        F.sum('Sales').alias('Sales'))
    .fillna(0))
df.show()
#  -------- ---- ------- ---- ------- ---- ------- ---- ------- 
# |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
#  -------- ---- ------- ---- ------- ---- ------- ---- ------- 
# |       1|   3|    102|   1|     78|   0|      0|   0|      0|
# |       3|   0|      0|   0|      0|   0|      0|   1|     12|
# |       2|   1|     45|   1|     34|   0|      0|   0|      0|
#  -------- ---- ------- ---- ------- ---- ------- ---- ------- 
  • Related