Home > Back-end >  Sum of year till data in a multiindex dataframe
Sum of year till data in a multiindex dataframe

Time:10-12

I have a multiindex dataframe

df

                        2021-06-28                                                                         2021-07-05      
                        total_orders    total_sales     last_year_sales     last_year_total_orders         total_orders     total_sales     last_year_sales     last_year_total_orders
group       category                                
A           Amazon      195.000         1,268.850       5,194.950           195.000                         ...            ...              ...                 ...
            Netflix     37.000          15,323.800      11,461.250          20.00
            Apple       61.000          18,274.385      19,102.220          30.00   
            Facebook    106.000         19,722.650      22,796.420          50.00
            Tesla       13.000          1,688.675       992.600             25.00
            Uber        4.000           1,906.000       3,671.150           15.00
            Google      61.000          55,547.255      30,853.115          10.00
            total       477.000         113,731.615     94,071.705          56.00
B           Amazon      50.000          3,219.650       6,022.300           400.00
            Netflix     17.000          5,918.500       4,749.000           30.00
            Apple       50.000          15,852.060      7,373.600           27.00
            Facebook    75.000          17,743.700      33,514.000          15.00
            Tesla       14.000          1,708.750       902.010             66.00
            Uber        3.000           937.010         349.300             94.00
            Google      43.000          37,795.150      35,943.450          65.00
            total       252.000         83,174.820      88,853.660          61.00

I am trying to create a year_to_date column where the sum would be the sum of total_sales starting from 2021-06-28 till today. As you can see above I have 4 columns per week and I have weeks starting from 2020-06-29 till this week 2021-10-11. So each week starting from 2021-06-28 I would have a column that would simply hold the cumulative sum of sales from all of the previous weeks starting at 2021-06-28.

I would like the result to look like so ( I removed some columns for better visual ):

                        2021-06-28                              2021-07-05                                                              2021-07-12
                        total_orders    total_sales             total_orders     total_sales            year_to_date_sales              total_orders     total_sales            year_to_date_sales     
group       category                                
A           Amazon      195.000         1,268.850                ...             1000                   1,268.850   1000                ...              5000                   1,268.850   1000   5000
            Netflix     37.000          15,323.800     
            Apple       61.000          18,274.385               ...             2000                   15,323.800   2000               ...              6000                   1,268.850   1000   6000
            Facebook    106.000         19,722.650     
            Tesla       13.000          1,688.675      
            Uber        4.000           1,906.000      
            Google      61.000          55,547.255     
            total       477.000         113,731.615    
B           Amazon      50.000          3,219.650      
            Netflix     17.000          5,918.500      
            Apple       50.000          15,852.060     
            Facebook    75.000          17,743.700     
            Tesla       14.000          1,708.750      
            Uber        3.000           937.010        
            Google      43.000          37,795.150     
            total       252.000         83,174.820     

I tried:

# Adding year to date sales

s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')] 
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)

# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me

And I assume I would use np.sum(s, axis = 1) to sum it row-wise so I would have the result for each group and category. But currently, I am unable to select the whole range of interest and my approach would create the same values for each week rather than each next week from 2021-06-28 having a cumulative sum.

How can I achieve such a result?

UPDATE

After using

df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
        .rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))

df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)

I do not get the new columns added for some reason:

df.loc[:,'2021-07-05'].columns


MultiIndex([('2021-07-05',       'total_orders'),
            ('2021-07-05',        'total_sales'),
            ('2021-07-05',        'last_year_sales'),
            ('2021-07-05',        'last_year_total_orders')]
           names=['created_at', None])

And when I check what df1 holds its only one column for 2021-06-28, I expect multiple columns for each week. The new column in the final df only exists at 2021-06-28 but not at other future dates.

df data

df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'A',
  8: 'A',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B',
  14: 'B',
  15: 'B',
  16: 'B',
  17: 'B',
  18: 'all',
  19: 'all'},
 ('category', ''): {0: 'Amazon',
  1: 'Apple',
  2: 'Facebook',
  3: 'Google',
  4: 'Netflix',
  5: 'Tesla',
  6: 'Total',
  7: 'Uber',
  8: 'total',
  9: 'Amazon',
  10: 'Apple',
  11: 'Facebook',
  12: 'Google',
  13: 'Netflix',
  14: 'Tesla',
  15: 'Total',
  16: 'Uber',
  17: 'total',
  18: 'Total',
  19: 'total'},
 (pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
  1: 61.0,
  2: 106.0,
  3: 61.0,
  4: 37.0,
  5: 13.0,
  6: 954.0,
  7: 4.0,
  8: 477.0,
  9: 50.0,
  10: 50.0,
  11: 75.0,
  12: 43.0,
  13: 17.0,
  14: 14.0,
  15: 504.0,
  16: 3.0,
  17: 252.0,
  18: 2916.0,
  19: 2916.0},
 (pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
  1: 18274.385000000002,
  2: 19722.65,
  3: 55547.255,
  4: 15323.800000000001,
  5: 1688.6749999999997,
  6: 227463.23,
  7: 1906.0,
  8: 113731.615,
  9: 3219.6499999999996,
  10: 15852.060000000001,
  11: 17743.7,
  12: 37795.15,
  13: 5918.5,
  14: 1708.75,
  15: 166349.64,
  16: 937.01,
  17: 83174.82,
  18: 787625.7400000001,
  19: 787625.7400000001},
 (pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
  1: 39.0,
  2: 79.0,
  3: 49.0,
  4: 10.0,
  5: 10.0,
  6: 436.0,
  7: 5.0,
  8: 218.0,
  9: 89.0,
  10: 34.0,
  11: 133.0,
  12: 66.0,
  13: 21.0,
  14: 20.0,
  15: 732.0,
  16: 3.0,
  17: 366.0,
  18: 2336.0,
  19: 2336.0},
 (pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
  1: 12138.96,
  2: 19084.175,
  3: 40033.46000000001,
  4: 4280.15,
  5: 1495.1,
  6: 165548.29,
  7: 1764.15,
  8: 82774.145,
  9: 8314.92,
  10: 12776.649999999996,
  11: 28048.075,
  12: 55104.21000000002,
  13: 6962.844999999999,
  14: 3053.2000000000003,
  15: 231049.11000000002,
  16: 1264.655,
  17: 115524.55500000001,
  18: 793194.8000000002,
  19: 793194.8000000002},
 (pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
  1: 56.0,
  2: 106.0,
  3: 44.0,
  4: 34.0,
  5: 13.0,
  6: 716.0,
  7: 9.0,
  8: 358.0,
  9: 101.0,
  10: 22.0,
  11: 120.0,
  12: 40.0,
  13: 13.0,
  14: 8.0,
  15: 610.0,
  16: 1.0,
  17: 305.0,
  18: 2652.0,
  19: 2652.0},
 (pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
  1: 19102.219999999994,
  2: 22796.420000000002,
  3: 30853.115,
  4: 11461.25,
  5: 992.6,
  6: 188143.41,
  7: 3671.15,
  8: 94071.705,
  9: 6022.299999999998,
  10: 7373.6,
  11: 33514.0,
  12: 35943.45,
  13: 4749.000000000001,
  14: 902.01,
  15: 177707.32,
  16: 349.3,
  17: 88853.66,
  18: 731701.46,
  19: 731701.46},
 (pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
  1: 47.0,
  2: 87.0,
  3: 45.0,
  4: 13.0,
  5: 8.0,
  6: 494.0,
  7: 2.0,
  8: 247.0,
  9: 81.0,
  10: 36.0,
  11: 143.0,
  12: 56.0,
  13: 9.0,
  14: 9.0,
  15: 670.0,
  16: 1.0,
  17: 335.0,
  18: 2328.0,
  19: 2328.0},
 (pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
  1: 14985.05,
  2: 16790.899999999998,
  3: 36202.729999999996,
  4: 4024.97,
  5: 1034.45,
  6: 163960.32999999996,
  7: 1385.65,
  8: 81980.16499999998,
  9: 5600.544999999999,
  10: 11209.92,
  11: 32832.61,
  12: 42137.44500000001,
  13: 3885.1499999999996,
  14: 1191.5,
  15: 194912.34000000003,
  16: 599.0,
  17: 97456.17000000001,
  18: 717745.3400000001,
  19: 717745.3400000001},
 (pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0}}).set_index(['group','category'])

CodePudding user response:

Use DataFrame.cumsum with axis=1, add to original and sorting columns in MultiIndex:

idx = pd.IndexSlice
df1 = (df.loc[:, idx['2021-06-28':'2021-10-11', 'total_sales']]
        .rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))

df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
  • Related