Home > Software design >  How to transpose data based on multiple lookup criterias in pandas - python
How to transpose data based on multiple lookup criterias in pandas - python

Time:12-28

I work as a payroll specialist and often encounter very weird reports where a name and id number of one employee is represented in multiple rows (column A and column B) and corresponding data is spread between many columns. Example:

id#, Name, PTO Code, Accrued Amount $, Accrued Hours, Used Amount, Used Hours, LeftAmount, LeftHours
101, Empl1, NY Sick,         0,            112,          0,         56,          0,           56
101, Empl1, Plan1, TO Am,    3600,          0,         1500,          0,         2100,        0               
101, Empl1, Plan1, PTO Hrs,   0,           240,          0,         100,           0,         140
101, Empl1, Plan2, PTO Am,   6000,          0,         6000,         0,            0,          0
101, Empl1, Plan2, PTO Hrs,    0,         400,           0,           400,          0,         0
201, Empl2, 

and so on...

This kind of report is a pain in the neck... I wrote a code (please don't laugh when you see it) that organize data. This is output:

id#, Name, NYC Sick,       NYC Sick     NYC Sick    Plan1 PTO Am  Plan1 PTOAm  Plan1 PTOAm  Plan1PTO 
           Hours Accrued  Hours Used   Hours Left      Accrued       Used         Left    HrsAccrued
101  Empl1    112              56          56            3600         1500         2100       240
201  Empl2 

and so on...

My goal's been accomplished, but it would be nice to look at some high top-notch (dry) code that would perform the same task. Please see my code below.

import pandas as pd
df = pd.read_excel('PTO report.xlsx')

creating a new data frame; Leave only ID column and Name column; get rid of duplicate values. will be an output file

 df_new = df[['ID#', 'Name']].drop_duplicates()

adding new columns

df_new[[
    'NYC Sick Hours Accrued', 'NYC Sick Hours Used', 
    'NYC Sick Hours Left', 'Plan1 PTO Amount Accrued', 
    'Plan1 PTO Amount Used', 'Plan1 PTO Amount Left', 'Plan1 PTO Hours Accrued', 
    'Plan1 PTO Hours Used', 'Plan1 PTO Hours Left', 'Plan2 PTO Amount Accrued', 
    'Plan2 PTO Amount Used', 'Plan2 PTO Amount Left', 'Plan2 PTO Hours Accrued', 
    'Plan2 PTO Hours Used', 'Plan2 PTO Hours Left']] = 0

looping through each employee and performing similar to vlookup operation. this part of code, in my view, has to be improved.

for i in df_new['ID#']:

    #NY Sick Hours Accrued ****************************************
    filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i )
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'NYC Sick Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
       
    #NY Sick Hours Used
    filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'NYC Sick Hours Used'] = int(df.loc[filt,'Used Hours'])
    
    #NY Sick Hours Left
    filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'NYC Sick Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
        
    #Plan1 PTO Amount Accrued *************************************
    filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
    
    #Plan1 PTO Amount Used
    filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
    
    #Plan1 PTO Amount Left 
    filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
    
    
    #Plan1 PTO Hours Accrued 
    filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours']) 
    
    #Plan1 PTO Hours Used
    filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Hours Used'] = int(df.loc[filt,'Used Hours']) 
    
    #Plan1 PTO Hours Left
    filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan1 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])    

       
   #Plan2 PTO Amount Accrued **************************************
    filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
    
    #Plan2 PTO Amount Used
    filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
    
    #Plan2 PTO Amount Left 
    filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
     
    #Plan2 PTO Hours Accrued 
    filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
    
    #Plan2 PTO Hours Used
    filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Hours Used'] = int(df.loc[filt,'Used Hours'])
    
    #Plan2 PTO Hours Left
    filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
    filt2 = df_new['ID#'] == i 
    df_new.loc[filt2, 'Plan2 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours']) 

exporting as a new file

df_new.to_excel('PTO report transposed.xlsx', index = False)

I wrote a macro in VBA excel that does the same job. I used a class object instead of 'vlookup'. There must be a simple solution. This is my first post, so if my question is unclear or title is wrong please let me know.

Thank you for your time!

CodePudding user response:

Use DataFrame.set_index with DataFrame.stack for reshape to MultiIndex Series, then remove 0 rows with Series.unstack, last flatten MultiIndex in columns:

s = df.set_index(['id#','Name','PTO Code']).stack()
df1 = s[s.ne(0)].unstack([-2,-1])
#alternatives if duplicates
#df1 = s[s.ne(0)].groupby(level=[0,1,2,3]).sum().unstack([-2,-1])
df1.columns = df1.columns.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.reset_index()

print (df1)
   id#   Name  NY Sick Accrued Hours  NY Sick Used Hours  NY Sick LeftHours  \
0  101  Empl1                    112                  56                 56   

   Plan1 TO Am Accrued Amount $  Plan1 TO Am Used Amount  \
0                          3600                     1500   

   Plan1 TO Am LeftAmount  Plan1 PTO Hrs Accrued Hours  \
0                    2100                          240   

   Plan1 PTO Hrs Used Hours  Plan1 PTO Hrs LeftHours  \
0                       100                      140   

   Plan2 PTO Am Accrued Amount $  Plan2 PTO Am Used Amount  \
0                           6000                      6000   

   Plan2 PTO Hrs Accrued Hours  Plan2 PTO Hrs Used Hours  
0                          400                       400  

CodePudding user response:

What you want is not entirely clear, especially since you have potential errors in your columns (same there commas in the PTO column?).

Anyway, assuming PTO code is a single column, this is essentially a pivot and removing the zeros/NaNs

(df.replace(0, float('nan'))
   .pivot(index=['id#','Name'], columns='PTO Code')
   .dropna(how='all', axis=1)
 )

Output:

          Accrued Amount $              Accrued Hours                            Used Amount              Used Hours                             LeftAmount LeftHours             
PTO Code       Plan1 TO Am Plan2 PTO Am       NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am Plan2 PTO Am    NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am   NY Sick Plan PTO Hrs
id# Name                                                                                                                                                                          
101 Empl1           3600.0       6000.0         112.0        240.0         400.0      1500.0       6000.0       56.0        100.0         400.0      2100.0      56.0        140.0
201 Empl2              NaN          NaN           NaN          NaN           NaN         NaN          NaN        NaN          NaN           NaN         NaN       NaN          NaN
  • Related