Home > Software design >  Spark/Python Dataframe - Transpose rows to columns
Spark/Python Dataframe - Transpose rows to columns

Time:11-09

I have a sample data as shown below


ProductionDate CostCenterKey AEMainCategoryKey AELossMagnitude
1/1/2022 100030 OAE 84.94871412
1/1/2022 100030 Overdeclared -0.010897228
1/1/2022 100030 UL 15.06218311
1/1/2022 100040 OAE 49.99603575
1/1/2022 100040 UL 50.00001425
1/1/2022 100040 Undeclared 0.003950003
1/2/2022 100030 OAE 71.58823183
1/2/2022 100030 UL 28.36946736
1/2/2022 100030 Undeclared 0.042300804
1/2/2022 100040 OAE 49.99702425
1/2/2022 100040 UL 50.00002575
1/2/2022 100040 Undeclared 0.002950002

I need to transpose AeMaincategoryKey column and need the output in the below form:

There should be 1 row for the combination of ProductionDate and CostCenterKey

ProductionDate CostCenterKey OAE Overdeclared UL Undeclared
1/1/2022 100030 84.94871412 -0.010897228 15.0621831 0
1/1/2022 100040 49.99603575 0 50.0000143 0.00395
1/2/2022 100030 71.58823183 0 28.3694674 0.0423008
1/2/2022 100040 49.99702425 0 50.0000258 0.00295

I am writing the below code but its not resulting the desired output.

from pyspark.sql import SparkSessionimport pandas as pd

##creating a Spark 
Dataframespark_df = sqlContext.sql("select * from hive_metastore.asseteffectiveness.asset_effectiveness_maincat where productiondate  in ('2022-01-01','2022-01-02') and costcenterkey in (100030,100040)")
##Converting to Spark Dataframepandas_df = spark_df.toPandas()
pandas_df.pivot_table(index=['ProductionDate','CostCenterKey'], columns=['AEMainCategoryKey'], values='AELossMagnitude',   fill_value=0)
display(pandas_df)

CodePudding user response:

From original format

   ProductionDate CostCenterKey AEMainCategoryKey  AELossMagnitude
0          1/1/22        100030               OAE        84.948714
1          1/1/22        100030      Overdeclared        -0.010897
2          1/1/22        100030                UL        15.062183
3          1/1/22        100040               OAE        49.996036
4          1/1/22        100040                UL        50.000014
5          1/1/22        100040        Undeclared         0.003950
6          1/2/22        100030               OAE        71.588232
7          1/2/22        100030                UL        28.369467
8          1/2/22        100030        Undeclared         0.042301
9          1/2/22        100040               OAE        49.997024
10         1/2/22        100040                UL        50.000026
11         1/2/22        100040        Undeclared         0.002950

Which I recreated in my code with this:

df = pd.DataFrame({'ProductionDate': ['1/1/22', '1/1/22', '1/1/22', '1/1/22',
                                      '1/1/22', '1/1/22', '1/2/22', '1/2/22',
                                      '1/2/22', '1/2/22', '1/2/22', '1/2/22'],
                   'CostCenterKey': ['100030', '100030', '100030',
                                     '100040', '100040', '100040',
                                     '100030', '100030', '100030',
                                     '100040', '100040', '100040'],
                   'AEMainCategoryKey': ['OAE', 'Overdeclared', 'UL',
                                         'OAE', 'UL', 'Undeclared',
                                         'OAE', 'UL', 'Undeclared',
                                         'OAE', 'UL', 'Undeclared',],
                   'AELossMagnitude': [84.94871412, -0.010897228, 15.06218311,
                                       49.99603575, 50.00001425, 0.003950003,
                                       71.58823183, 28.36946736, 0.042300804,
                                       49.99702425, 50.00002575, 0.002950002]})

I have two versions.

Version 1

df3 = df.groupby(['CostCenterKey', 'ProductionDate']).first().unstack(
    'ProductionDate').reset_index().dropna(axis='columns')
print(df3)

Result of version 1

               CostCenterKey AEMainCategoryKey  ... AELossMagnitude           
ProductionDate                          1/1/22  ...          1/1/22     1/2/22
0                     100030               OAE  ...       84.948714  71.588232
1                     100040               OAE  ...       49.996036  49.997024

Version 2

df2 = df.pivot(index=['CostCenterKey', 'ProductionDate'],
               columns=['AEMainCategoryKey'], values=[
        'AELossMagnitude']).reset_index().dropna(axis='columns')
print(df2)

Result of version 2

                  CostCenterKey ProductionDate AELossMagnitude           
AEMainCategoryKey                                          OAE         UL
0                        100030         1/1/22       84.948714  15.062183
1                        100030         1/2/22       71.588232  28.369467
2                        100040         1/1/22       49.996036  50.000014
3                        100040         1/2/22       49.997024  50.000026
  • Related