I work as a payroll specialist and often encounter very weird reports where a name and id number of one employee is represented in multiple rows (column A and column B) and corresponding data is spread between many columns. Example:
id#, Name, PTO Code, Accrued Amount $, Accrued Hours, Used Amount, Used Hours, LeftAmount, LeftHours
101, Empl1, NY Sick, 0, 112, 0, 56, 0, 56
101, Empl1, Plan1, TO Am, 3600, 0, 1500, 0, 2100, 0
101, Empl1, Plan1, PTO Hrs, 0, 240, 0, 100, 0, 140
101, Empl1, Plan2, PTO Am, 6000, 0, 6000, 0, 0, 0
101, Empl1, Plan2, PTO Hrs, 0, 400, 0, 400, 0, 0
201, Empl2,
and so on...
This kind of report is a pain in the neck... I wrote a code (please don't laugh when you see it) that organize data. This is output:
id#, Name, NYC Sick, NYC Sick NYC Sick Plan1 PTO Am Plan1 PTOAm Plan1 PTOAm Plan1PTO
Hours Accrued Hours Used Hours Left Accrued Used Left HrsAccrued
101 Empl1 112 56 56 3600 1500 2100 240
201 Empl2
and so on...
My goal's been accomplished, but it would be nice to look at some high top-notch (dry) code that would perform the same task. Please see my code below.
import pandas as pd
df = pd.read_excel('PTO report.xlsx')
creating a new data frame; Leave only ID column and Name column; get rid of duplicate values. will be an output file
df_new = df[['ID#', 'Name']].drop_duplicates()
adding new columns
df_new[[
'NYC Sick Hours Accrued', 'NYC Sick Hours Used',
'NYC Sick Hours Left', 'Plan1 PTO Amount Accrued',
'Plan1 PTO Amount Used', 'Plan1 PTO Amount Left', 'Plan1 PTO Hours Accrued',
'Plan1 PTO Hours Used', 'Plan1 PTO Hours Left', 'Plan2 PTO Amount Accrued',
'Plan2 PTO Amount Used', 'Plan2 PTO Amount Left', 'Plan2 PTO Hours Accrued',
'Plan2 PTO Hours Used', 'Plan2 PTO Hours Left']] = 0
looping through each employee and performing similar to vlookup operation. this part of code, in my view, has to be improved.
for i in df_new['ID#']:
#NY Sick Hours Accrued ****************************************
filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i )
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'NYC Sick Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
#NY Sick Hours Used
filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'NYC Sick Hours Used'] = int(df.loc[filt,'Used Hours'])
#NY Sick Hours Left
filt = (df['PTO Code'] == 'NYC Sick Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'NYC Sick Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
#Plan1 PTO Amount Accrued *************************************
filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
#Plan1 PTO Amount Used
filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
#Plan1 PTO Amount Left
filt = (df['PTO Code'] == 'Plan1 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
#Plan1 PTO Hours Accrued
filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
#Plan1 PTO Hours Used
filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Hours Used'] = int(df.loc[filt,'Used Hours'])
#Plan1 PTO Hours Left
filt = (df['PTO Code'] == 'Plan1 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan1 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
#Plan2 PTO Amount Accrued **************************************
filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Amount Accrued'] = int(df.loc[filt,'Accrued Amount $'])
#Plan2 PTO Amount Used
filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Amount Used'] = int(df.loc[filt,'Used Amount $'])
#Plan2 PTO Amount Left
filt = (df['PTO Code'] == 'Plan2 PTO Amount') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Amount Left'] = int(df.loc[filt,'Ending Balance Amount $'])
#Plan2 PTO Hours Accrued
filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Hours Accrued'] = int(df.loc[filt,'Accrued Hours'])
#Plan2 PTO Hours Used
filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Hours Used'] = int(df.loc[filt,'Used Hours'])
#Plan2 PTO Hours Left
filt = (df['PTO Code'] == 'Plan2 PTO Hours') & (df['ID#'] == i)
filt2 = df_new['ID#'] == i
df_new.loc[filt2, 'Plan2 PTO Hours Left'] = int(df.loc[filt,'Ending Balance Hours'])
exporting as a new file
df_new.to_excel('PTO report transposed.xlsx', index = False)
I wrote a macro in VBA excel that does the same job. I used a class object instead of 'vlookup'. There must be a simple solution. This is my first post, so if my question is unclear or title is wrong please let me know.
Thank you for your time!
CodePudding user response:
Use DataFrame.set_index
with DataFrame.stack
for reshape to MultiIndex Series
, then remove 0
rows with Series.unstack
, last flatten MultiIndex in columns
:
s = df.set_index(['id#','Name','PTO Code']).stack()
df1 = s[s.ne(0)].unstack([-2,-1])
#alternatives if duplicates
#df1 = s[s.ne(0)].groupby(level=[0,1,2,3]).sum().unstack([-2,-1])
df1.columns = df1.columns.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.reset_index()
print (df1)
id# Name NY Sick Accrued Hours NY Sick Used Hours NY Sick LeftHours \
0 101 Empl1 112 56 56
Plan1 TO Am Accrued Amount $ Plan1 TO Am Used Amount \
0 3600 1500
Plan1 TO Am LeftAmount Plan1 PTO Hrs Accrued Hours \
0 2100 240
Plan1 PTO Hrs Used Hours Plan1 PTO Hrs LeftHours \
0 100 140
Plan2 PTO Am Accrued Amount $ Plan2 PTO Am Used Amount \
0 6000 6000
Plan2 PTO Hrs Accrued Hours Plan2 PTO Hrs Used Hours
0 400 400
CodePudding user response:
What you want is not entirely clear, especially since you have potential errors in your columns (same there commas in the PTO column?).
Anyway, assuming PTO code is a single column, this is essentially a pivot
and removing the zeros/NaNs
(df.replace(0, float('nan'))
.pivot(index=['id#','Name'], columns='PTO Code')
.dropna(how='all', axis=1)
)
Output:
Accrued Amount $ Accrued Hours Used Amount Used Hours LeftAmount LeftHours
PTO Code Plan1 TO Am Plan2 PTO Am NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am Plan2 PTO Am NY Sick Plan PTO Hrs Plan2 PTO Hrs Plan1 TO Am NY Sick Plan PTO Hrs
id# Name
101 Empl1 3600.0 6000.0 112.0 240.0 400.0 1500.0 6000.0 56.0 100.0 400.0 2100.0 56.0 140.0
201 Empl2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN