I have a PySpark df
:
Store_ID | Category | ID | Sales |
---|---|---|---|
1 | A | 123 | 23 |
2 | A | 123 | 45 |
1 | A | 234 | 67 |
1 | B | 567 | 78 |
2 | B | 567 | 34 |
3 | D | 789 | 12 |
1 | A | 890 | 12 |
Expected:
Store_ID | A_ID | B_ID | C_ID | D_ID | Sales_A | Sales_B | Sales_C | Sales_D |
---|---|---|---|---|---|---|---|---|
1 | 3 | 1 | 0 | 0 | 102 | 78 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 45 | 34 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 12 |
I am able to transform this way using SQL (created a temp view):
SELECT Store_Id,
SUM(IF(Category='A',Sales,0)) AS Sales_A,
SUM(IF(Category='B',Sales,0)) AS Sales_B,
SUM(IF(Category='C',Sales,0)) AS Sales_C,
SUM(IF(Category='D',Sales,0)) AS Sales_D,
COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;
How do we achieve the same in PySpark using native functions as its much faster?
CodePudding user response:
This operation is called pivoting.
- a couple of aggregations, since you need both, count of ID and sum of Sales
alias
for aggregations, for changing column names- providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'A', 123, 23),
(2, 'A', 123, 45),
(1, 'A', 234, 67),
(1, 'B', 567, 78),
(2, 'B', 567, 34),
(3, 'D', 789, 12),
(1, 'A', 890, 12)],
['Store_ID', 'Category', 'ID', 'Sales'])
Script:
df = (df
.groupBy('Store_ID')
.pivot('Category', ['A', 'B', 'C', 'D'])
.agg(
F.countDistinct('ID').alias('ID'),
F.sum('Sales').alias('Sales'))
.fillna(0))
df.show()
# -------- ---- ------- ---- ------- ---- ------- ---- -------
# |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
# -------- ---- ------- ---- ------- ---- ------- ---- -------
# | 1| 3| 102| 1| 78| 0| 0| 0| 0|
# | 3| 0| 0| 0| 0| 0| 0| 1| 12|
# | 2| 1| 45| 1| 34| 0| 0| 0| 0|
# -------- ---- ------- ---- ------- ---- ------- ---- -------