I am able to get the results I need but am curious to see if this can be done more efficiently (as in fewer steps) or by some built-in function.
As mockup data, I have 3 years of daily widget sales. The goal is to compare totals for previous-year same-quarter sales. So Q1 2019 sales compared to Q1 2020; Q1 2020 to Q1 2021 sales; Q2 2019 to Q2 2020 etc..
My strategy was:
- Resample/downsample sales data into quarters
- Create a conditional that creates a new column that names each quarter values like Q1, Q2, Q3 and Q4.
- Create a column that only has the years.
- create a new df by a groupby that results in a multiindex df that groups each quarter by years...so all Q1s are together and listed by years.
- Now we do another groupby that finds the difference for like quarters year by year.
Here is the code with resampled/downsampled data:
sample_data = [{'date': Timestamp('2019-03-31 00:00:00'), '#_widgets': 4502},
{'date': Timestamp('2019-06-30 00:00:00'), '#_widgets': 4388},
{'date': Timestamp('2019-09-30 00:00:00'), '#_widgets': 4703},
{'date': Timestamp('2019-12-31 00:00:00'), '#_widgets': 4344},
{'date': Timestamp('2020-03-31 00:00:00'), '#_widgets': 4206},
{'date': Timestamp('2020-06-30 00:00:00'), '#_widgets': 4281},
{'date': Timestamp('2020-09-30 00:00:00'), '#_widgets': 4879},
{'date': Timestamp('2020-12-31 00:00:00'), '#_widgets': 4115},
{'date': Timestamp('2021-03-31 00:00:00'), '#_widgets': 4313},
{'date': Timestamp('2021-06-30 00:00:00'), '#_widgets': 4578},
{'date': Timestamp('2021-09-30 00:00:00'), '#_widgets': 4467},
{'date': Timestamp('2021-12-31 00:00:00'), '#_widgets': 4871}]
Import libraries
import pandas as pd
from datetime import datetime
import numpy as np
Create conditions to add Q1, Q2, etc using the idea that Q1 will have the month 3, Q2 month 6 etc.
abbrv = dfq["date"].dt.month
conditions = {
"Q1": abbrv == 3,
"Q2": abbrv == 6,
"Q3": abbrv == 9,
"Q4": abbrv == 12
}
Then i passed the conditions:
## use numpy to run conditions
dfq['quarter'] = np.select(conditions.values(), conditions.keys(), default="Not Available")
Add a year column so i can grouby:
## add years to it
dfq["year"] = dfq["date"].dt.year
Create new df using groupby:
## group by to get sort by quarters in each year
dff = dfq.groupby(["quarter", "year", ])["#_widgets"].sum().to_frame()
In my step final, I groupby again but get the difference for same-period quarters:
## find the difference for previous year quarter comparisons
dff['diff'] = dff.groupby(['quarter']).diff()
This works fine but it seems like there has to be a way with fewer steps without adding too much complexity.
CodePudding user response:
Your approach looks fine to me having said that you can certainly make it more concise by using the pandas builtin method called to_period
to convert the datetime directly to quarter.
d = df['date'].dt.to_period('Q').dt
df['year'], df['quarter'] = d.year, d.quarter
dff = df.groupby(['quarter', 'year'], as_index=False)['#_widgets'].sum()
dff['diff'] = dff.groupby('quarter')['#_widgets'].diff()
Result
quarter year #_widgets diff
0 1 2019 4502 NaN
1 1 2020 4206 -296.0
2 1 2021 4313 107.0
3 2 2019 4388 NaN
4 2 2020 4281 -107.0
5 2 2021 4578 297.0
6 3 2019 4703 NaN
7 3 2020 4879 176.0
8 3 2021 4467 -412.0
9 4 2019 4344 NaN
10 4 2020 4115 -229.0
11 4 2021 4871 756.0
Here is the alternative approach with different output representation in case you are interested in trying :-)
d = df['date'].dt.to_period('Q').dt
df['year'], df['quarter'] = d.year, d.quarter
s = df.pivot_table('#_widgets', 'quarter', 'year', aggfunc='sum')
pd.concat([s, s.diff(axis=1)], axis=1, keys=['#_widgets', 'diff'])
Result
#_widgets diff
year 2019 2020 2021 2019 2020 2021
quarter
1 4502 4206 4313 NaN -296 107
2 4388 4281 4578 NaN -107 297
3 4703 4879 4467 NaN 176 -412
4 4344 4115 4871 NaN -229 756