Home > Software engineering >  Add only available date values to monthly dataframe
Add only available date values to monthly dataframe

Time:11-09

I have a DataFrame with employee time sheets, total hour worked as well as overtime hours.

example = pd.DataFrame({'Employee': ["Alex", "Alex", "Alex", "Bob", "Peter", "Peter"], 'date': ['2021-01-01', '2021-01-01', '2021-01-03', '2021-01-02', '2021-01-01', '2021-01-02'],
                       'Total Hour': [1.5, 2.2, 7, 1, 3, 6], 'Overtime': [1.5, 0, 1.2, 2.3, 1.7, 5]})
print(example)

  Employee        date  Total Hour  Overtime
0     Alex  2021-01-01         1.5       1.5
1     Alex  2021-01-01         2.2       0.0
2     Alex  2021-01-03         7.0       1.2
3      Bob  2021-01-02         1.0       2.3
4    Peter  2021-01-01         3.0       1.7
5    Peter  2021-01-02         6.0       5.0

I want to create a monthly DataFrame with all the days of the month and fill only available timesheets in it, something like this:

                          2021-01-01  2021-01-02  2021-01-03  2021-01-04  2021-01-05  ... 2021-01-31
  Employee           
0     Alex   Total Hour       3.7                     7.0
1     Alex   Overtime         1.5                     1.2
2     Bob    Total Hour                   1 
3     Bob    Overtime                     2.3
4     Peter  Total Hour       3.0         6
5     Peter  Overtime         1.7         5

So as employees enter their hours, this will become more complete.

I tried to figure it out but I think I am missing something very fundamental here.

CodePudding user response:

Let's use pivot_table then stack level 0 to convert column MultiIndex to row MultiIndex:

result_df = example.pivot_table(
    index='Employee',
    columns='date',
    values=['Total Hour', 'Overtime'],
    aggfunc='sum'
).stack(level=0)

Or equivalently with groupby sum then stack unstack to swap column and row MultiIndexes:

result_df = example.groupby(['Employee', 'date']).sum().stack().unstack(level=1)

result_df:

date                 2021-01-01  2021-01-02  2021-01-03
Employee                                               
Alex     Overtime           1.5         NaN         1.2
         Total Hour         3.7         NaN         7.0
Bob      Overtime           NaN         2.3         NaN
         Total Hour         NaN         1.0         NaN
Peter    Overtime           1.7         5.0         NaN
         Total Hour         3.0         6.0         NaN

Either approach can be followed with reindex to order level 1 so that Overtime appears first and Total Hour appears second per Employee. Then rename_axis and reset_index for some clean up of index and column labels:

result_df = result_df.reindex(
    ['Overtime', 'Total Hour'], level=1
).rename_axis(
    index=['Employee', 'Hours'], columns=None
).reset_index()

result_df:

  Employee       Hours  2021-01-01  2021-01-02  2021-01-03
0     Alex    Overtime         1.5         NaN         1.2
1     Alex  Total Hour         3.7         NaN         7.0
2      Bob    Overtime         NaN         2.3         NaN
3      Bob  Total Hour         NaN         1.0         NaN
4    Peter    Overtime         1.7         5.0         NaN
5    Peter  Total Hour         3.0         6.0         NaN
  • Related