I am trying to divide line items with a start and end date into multiple rows based on months. Values should be calculated based on number of days in the specific months.
For instance, data of 1 line item:
id | StartDate | EndDate | Annual |
---|---|---|---|
abc | 12/12/2018 | 01/12/2019 | 120,450 |
expected output:
id | Month | Year | Monthly volume |
---|---|---|---|
abc | 12 | 2018 | 6,600 |
abc | 1 | 2019 | 10,230 |
abc | 2 | 2019 | 9,240 |
abc | 3 | 2019 | 10,230 |
abc | 4 | 2019 | 9,900 |
abc | 5 | 2019 | 10,230 |
abc | 6 | 2019 | 9,900 |
abc | 7 | 2019 | 10,230 |
abc | 8 | 2019 | 10,230 |
abc | 9 | 2019 | 9,900 |
abc | 10 | 2019 | 10,230 |
abc | 11 | 2019 | 9,900 |
CodePudding user response:
Few things for next time you ask.
- This is a case where there are existing answers, so always try google first to reduce duplication. Other post is referenced in code below.
- You should also always include the code you already tried, SO doesn't like to do your homework, but we will help you with it.
- You should include a more readily reproduced dataframe. I shouldn't have to copy paste to build it, as in below code.
- you are clearly doing something to convert the Annual total to a monthly volume but you do not explain this, so do not expect it to be done for you.
- Lastly, this code doesn't convert to separate month and year columns, but once you have the date, this should be trivial for you to do (or to google how to do).
import pandas as pd
df = pd.DataFrame(
data = [['abc','12/12/2018','12/01/2019',120450]],
columns = ['id', 'startDate', 'EndDate', 'Annual']
)
df['startDate'] = pd.to_datetime(df['startDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
# pd.bdate_range(start="2020/12/16", end="2020/12/26", freq="C", weekmask="Sat Sun")
# %%
df_start_end = df.melt(id_vars=['id', 'Annual'],value_name='date')
# credit to u/gen
# https://stackoverflow.com/questions/42151886/expanding-pandas-data-frame-with-date-range-in-columns
df = (
df_start_end.groupby('id')
.apply(lambda x: x.set_index('date')
.resample('M').pad())
.drop(columns=['id','variable'])
.reset_index()
)
print(df)