How do I allocate grand totals of years of data into each year in Pandas?-CodePudding

I have this original dataset in Pandas dataframe:

Program	Start year	Duration	Total amount
a	2021	2	$30000
a	2022	3	$3000
b	2020	2.5	$15000

Want to rearrange into the following dataframe:

Program	Year	Annual amount
a	2021	$15,000
a	2022	$16,000
a	2023	$1,000
a	2024	$1,000
b	2020	$6,000
b	2021	$6,000
b	2022	$3,000

CodePudding user response：

So, given the following dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "Program": {0: "a", 1: "a", 2: "b"},
        "Start year": {0: 2021, 1: 2022, 2: 2020},
        "Duration": {0: 2.0, 1: 3.0, 2: 2.5},
        "Total amount": {0: "$30000", 1: "$3000", 2: "$15000"},
    }
)

You could try like this:

# Setup
df["Total amount"] = df["Total amount"].str.replace("$", "").astype("int")
df["Annual amount"] = df["Total amount"] / df["Duration"]
new_df = pd.DataFrame()

# Iterate on the datafram and append values to new one
for _, row in df.iterrows():
    for i in range(int(row["Duration"])):
        new_df = new_df.append(
            {
                "Program": row["Program"],
                "Year": row["Start year"]   i,
                "Annual amount": row["Annual amount"],
            },
            ignore_index=True,
        )
    if row["Duration"] % 2:  # For half years
        new_df = new_df.append(
            {
                "Program": row["Program"],
                "Year": row["Start year"]   i   1,
                "Annual amount": row["Annual amount"] * (row["Duration"] % 2),
            },
            ignore_index=True,
        )

# Aggregate rows
new_df = new_df.groupby(["Program", "Year"]).sum()

# Unstack and cleanup
new_df = new_df.reset_index()
new_df["Year"] = new_df["Year"].astype("int")

print(new_df)
# Outputs
  Program  Year  Annual amount
0       a  2021        15000.0
1       a  2022        16000.0
2       a  2023         1000.0
3       a  2024         1000.0
4       a  2025         1000.0
5       b  2020         6000.0
6       b  2021         6000.0
7       b  2022         3000.0

CodePudding user response：

Define a custom function to split an amount in decimal buckets, then apply it, explode and aggregate using groupby sum:

def split(duration, amount):
    d,r = divmod(duration, 1)
    a = amount/duration
    return [a]*int(d) [a*r]*int(r>0)

df_annual = \
(df.assign(**{'Annual amount': df.apply(lambda r: split(r['Duration'], float(r['Total amount'].strip('$'))), axis=1)})
   .explode('Annual amount')
   .assign(Year=lambda d: d.groupby(['Program', 'Start year']).cumcount() d['Start year'])
   .groupby(['Program', 'Year'], as_index=False)['Annual amount'].sum()
 )

Output:

  Program  Year  Annual amount
0       a  2021        15000.0
1       a  2022        16000.0
2       a  2023         1000.0
3       a  2024         1000.0
4       b  2020         6000.0
5       b  2021         6000.0
6       b  2022         3000.0

CodePudding user response：

repeat indices by df['Duration'], iteratively change year to consecutive years and finally, groupby ['Program','Start year']. The tricky part is to add the decimal duration time. Add that part using a mask:

df = pd.DataFrame([['a',2021,2,'$30000'],['a',2022,3,'$3000'],['b',2020,2.5,'$15000']], 
                  columns=['Program','Start year','Duration','Total amount'])
df['Annual amount'] = df['Total amount'].str[1:].astype(float) / df['Duration']
mask = df['Duration'].astype(str).str.split('.').str[1] != '0'
df_new = df.loc[df.index.repeat(df['Duration'])]
df.loc[mask,'Annual amount'] = df.loc[mask,'Annual amount'] * (df.loc[mask,'Duration'] % 1)
df_new = df_new.append(df[mask])
df.loc[mask,'Annual amount'] = df.loc[mask,'Total amount'].str[1:].astype(float) / (df.loc[mask,'Duration'] % 1)
for i in df_new.index.unique():
    year = int(df_new.loc[i,'Start year'].iloc[0])
    length = int(np.ceil(df_new.loc[i,'Duration'].iloc[0]))
    df_new.loc[i,'Start year'] = range(year,year length)
df_new = df_new.groupby(['Program','Start year'])['Annual amount'].sum().reset_index()