I have a sample data as shown below
ProductionDate | CostCenterKey | AEMainCategoryKey | AELossMagnitude |
---|---|---|---|
1/1/2022 | 100030 | OAE | 84.94871412 |
1/1/2022 | 100030 | Overdeclared | -0.010897228 |
1/1/2022 | 100030 | UL | 15.06218311 |
1/1/2022 | 100040 | OAE | 49.99603575 |
1/1/2022 | 100040 | UL | 50.00001425 |
1/1/2022 | 100040 | Undeclared | 0.003950003 |
1/2/2022 | 100030 | OAE | 71.58823183 |
1/2/2022 | 100030 | UL | 28.36946736 |
1/2/2022 | 100030 | Undeclared | 0.042300804 |
1/2/2022 | 100040 | OAE | 49.99702425 |
1/2/2022 | 100040 | UL | 50.00002575 |
1/2/2022 | 100040 | Undeclared | 0.002950002 |
I need to transpose AeMaincategoryKey column and need the output in the below form:
There should be 1 row for the combination of ProductionDate and CostCenterKey
ProductionDate | CostCenterKey | OAE | Overdeclared | UL | Undeclared |
---|---|---|---|---|---|
1/1/2022 | 100030 | 84.94871412 | -0.010897228 | 15.0621831 | 0 |
1/1/2022 | 100040 | 49.99603575 | 0 | 50.0000143 | 0.00395 |
1/2/2022 | 100030 | 71.58823183 | 0 | 28.3694674 | 0.0423008 |
1/2/2022 | 100040 | 49.99702425 | 0 | 50.0000258 | 0.00295 |
I am writing the below code but its not resulting the desired output.
from pyspark.sql import SparkSessionimport pandas as pd
##creating a Spark
Dataframespark_df = sqlContext.sql("select * from hive_metastore.asseteffectiveness.asset_effectiveness_maincat where productiondate in ('2022-01-01','2022-01-02') and costcenterkey in (100030,100040)")
##Converting to Spark Dataframepandas_df = spark_df.toPandas()
pandas_df.pivot_table(index=['ProductionDate','CostCenterKey'], columns=['AEMainCategoryKey'], values='AELossMagnitude', fill_value=0)
display(pandas_df)
CodePudding user response:
From original format
ProductionDate CostCenterKey AEMainCategoryKey AELossMagnitude
0 1/1/22 100030 OAE 84.948714
1 1/1/22 100030 Overdeclared -0.010897
2 1/1/22 100030 UL 15.062183
3 1/1/22 100040 OAE 49.996036
4 1/1/22 100040 UL 50.000014
5 1/1/22 100040 Undeclared 0.003950
6 1/2/22 100030 OAE 71.588232
7 1/2/22 100030 UL 28.369467
8 1/2/22 100030 Undeclared 0.042301
9 1/2/22 100040 OAE 49.997024
10 1/2/22 100040 UL 50.000026
11 1/2/22 100040 Undeclared 0.002950
Which I recreated in my code with this:
df = pd.DataFrame({'ProductionDate': ['1/1/22', '1/1/22', '1/1/22', '1/1/22',
'1/1/22', '1/1/22', '1/2/22', '1/2/22',
'1/2/22', '1/2/22', '1/2/22', '1/2/22'],
'CostCenterKey': ['100030', '100030', '100030',
'100040', '100040', '100040',
'100030', '100030', '100030',
'100040', '100040', '100040'],
'AEMainCategoryKey': ['OAE', 'Overdeclared', 'UL',
'OAE', 'UL', 'Undeclared',
'OAE', 'UL', 'Undeclared',
'OAE', 'UL', 'Undeclared',],
'AELossMagnitude': [84.94871412, -0.010897228, 15.06218311,
49.99603575, 50.00001425, 0.003950003,
71.58823183, 28.36946736, 0.042300804,
49.99702425, 50.00002575, 0.002950002]})
I have two versions.
Version 1
df3 = df.groupby(['CostCenterKey', 'ProductionDate']).first().unstack(
'ProductionDate').reset_index().dropna(axis='columns')
print(df3)
Result of version 1
CostCenterKey AEMainCategoryKey ... AELossMagnitude
ProductionDate 1/1/22 ... 1/1/22 1/2/22
0 100030 OAE ... 84.948714 71.588232
1 100040 OAE ... 49.996036 49.997024
Version 2
df2 = df.pivot(index=['CostCenterKey', 'ProductionDate'],
columns=['AEMainCategoryKey'], values=[
'AELossMagnitude']).reset_index().dropna(axis='columns')
print(df2)
Result of version 2
CostCenterKey ProductionDate AELossMagnitude
AEMainCategoryKey OAE UL
0 100030 1/1/22 84.948714 15.062183
1 100030 1/2/22 71.588232 28.369467
2 100040 1/1/22 49.996036 50.000014
3 100040 1/2/22 49.997024 50.000026