I have a multiindex dataframe
df
2021-06-28 2021-07-05
total_orders total_sales last_year_sales last_year_total_orders total_orders total_sales last_year_sales last_year_total_orders
group category
A Amazon 195.000 1,268.850 5,194.950 195.000 ... ... ... ...
Netflix 37.000 15,323.800 11,461.250 20.00
Apple 61.000 18,274.385 19,102.220 30.00
Facebook 106.000 19,722.650 22,796.420 50.00
Tesla 13.000 1,688.675 992.600 25.00
Uber 4.000 1,906.000 3,671.150 15.00
Google 61.000 55,547.255 30,853.115 10.00
total 477.000 113,731.615 94,071.705 56.00
B Amazon 50.000 3,219.650 6,022.300 400.00
Netflix 17.000 5,918.500 4,749.000 30.00
Apple 50.000 15,852.060 7,373.600 27.00
Facebook 75.000 17,743.700 33,514.000 15.00
Tesla 14.000 1,708.750 902.010 66.00
Uber 3.000 937.010 349.300 94.00
Google 43.000 37,795.150 35,943.450 65.00
total 252.000 83,174.820 88,853.660 61.00
I am trying to create a year_to_date
column where the sum
would be the sum
of total_sales
starting from 2021-06-28
till today. As you can see above I have 4 columns per week and I have weeks starting from 2020-06-29
till this week 2021-10-11
. So each week starting from 2021-06-28
I would have a column that would simply hold the cumulative sum of sales
from all of the previous weeks starting at 2021-06-28
.
I would like the result to look like so ( I removed some columns for better visual ):
2021-06-28 2021-07-05 2021-07-12
total_orders total_sales total_orders total_sales year_to_date_sales total_orders total_sales year_to_date_sales
group category
A Amazon 195.000 1,268.850 ... 1000 1,268.850 1000 ... 5000 1,268.850 1000 5000
Netflix 37.000 15,323.800
Apple 61.000 18,274.385 ... 2000 15,323.800 2000 ... 6000 1,268.850 1000 6000
Facebook 106.000 19,722.650
Tesla 13.000 1,688.675
Uber 4.000 1,906.000
Google 61.000 55,547.255
total 477.000 113,731.615
B Amazon 50.000 3,219.650
Netflix 17.000 5,918.500
Apple 50.000 15,852.060
Facebook 75.000 17,743.700
Tesla 14.000 1,708.750
Uber 3.000 937.010
Google 43.000 37,795.150
total 252.000 83,174.820
I tried:
# Adding year to date sales
s = df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
s = np.sum(s, axis = 1)
s = s.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1)
df = df.combine_first(s)
# I tried ['2021-06-28':'2021-10-11'] to select all columns from - to but it does not let me
And I assume I would use np.sum(s, axis = 1)
to sum
it row-wise so I would have the result for each group
and category
. But currently, I am unable to select the whole range of interest and my approach would create the same values for each week rather than each next week from 2021-06-28
having a cumulative sum.
How can I achieve such a result?
UPDATE
After using
df1 = (df.loc[:, (['2021-06-28','2021-10-11'], 'total_sales')]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)
I do not get the new columns added for some reason:
df.loc[:,'2021-07-05'].columns
MultiIndex([('2021-07-05', 'total_orders'),
('2021-07-05', 'total_sales'),
('2021-07-05', 'last_year_sales'),
('2021-07-05', 'last_year_total_orders')]
names=['created_at', None])
And when I check what df1
holds its only one column for 2021-06-28
, I expect multiple columns for each week. The new column in the final df
only exists at 2021-06-28
but not at other future dates.
df data
df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29 00:00:00'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29 00:00:00'), 'total_sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06 00:00:00'), 'total_sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28 00:00:00'), 'total_sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07 00:00:00'), 'total_sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07 00:00:00'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
CodePudding user response:
Use DataFrame.cumsum
with axis=1
, add to original and sorting columns in MultiIndex
:
idx = pd.IndexSlice
df1 = (df.loc[:, idx['2021-06-28':'2021-10-11', 'total_sales']]
.rename({'total_sales': 'year_to_date_sales'}, axis=1, level=1))
df = df.join(df1.cumsum(axis=1)).sort_index(axis=1)