I have this original dataset in Pandas dataframe:
Program | Start year | Duration | Total amount |
---|---|---|---|
a | 2021 | 2 | $30000 |
a | 2022 | 3 | $3000 |
b | 2020 | 2.5 | $15000 |
Want to rearrange into the following dataframe:
Program | Year | Annual amount |
---|---|---|
a | 2021 | $15,000 |
a | 2022 | $16,000 |
a | 2023 | $1,000 |
a | 2024 | $1,000 |
b | 2020 | $6,000 |
b | 2021 | $6,000 |
b | 2022 | $3,000 |
CodePudding user response:
So, given the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Program": {0: "a", 1: "a", 2: "b"},
"Start year": {0: 2021, 1: 2022, 2: 2020},
"Duration": {0: 2.0, 1: 3.0, 2: 2.5},
"Total amount": {0: "$30000", 1: "$3000", 2: "$15000"},
}
)
You could try like this:
# Setup
df["Total amount"] = df["Total amount"].str.replace("$", "").astype("int")
df["Annual amount"] = df["Total amount"] / df["Duration"]
new_df = pd.DataFrame()
# Iterate on the datafram and append values to new one
for _, row in df.iterrows():
for i in range(int(row["Duration"])):
new_df = new_df.append(
{
"Program": row["Program"],
"Year": row["Start year"] i,
"Annual amount": row["Annual amount"],
},
ignore_index=True,
)
if row["Duration"] % 2: # For half years
new_df = new_df.append(
{
"Program": row["Program"],
"Year": row["Start year"] i 1,
"Annual amount": row["Annual amount"] * (row["Duration"] % 2),
},
ignore_index=True,
)
# Aggregate rows
new_df = new_df.groupby(["Program", "Year"]).sum()
# Unstack and cleanup
new_df = new_df.reset_index()
new_df["Year"] = new_df["Year"].astype("int")
print(new_df)
# Outputs
Program Year Annual amount
0 a 2021 15000.0
1 a 2022 16000.0
2 a 2023 1000.0
3 a 2024 1000.0
4 a 2025 1000.0
5 b 2020 6000.0
6 b 2021 6000.0
7 b 2022 3000.0
CodePudding user response:
Define a custom function to split an amount in decimal buckets, then apply
it, explode
and aggregate using groupby
sum
:
def split(duration, amount):
d,r = divmod(duration, 1)
a = amount/duration
return [a]*int(d) [a*r]*int(r>0)
df_annual = \
(df.assign(**{'Annual amount': df.apply(lambda r: split(r['Duration'], float(r['Total amount'].strip('$'))), axis=1)})
.explode('Annual amount')
.assign(Year=lambda d: d.groupby(['Program', 'Start year']).cumcount() d['Start year'])
.groupby(['Program', 'Year'], as_index=False)['Annual amount'].sum()
)
Output:
Program Year Annual amount
0 a 2021 15000.0
1 a 2022 16000.0
2 a 2023 1000.0
3 a 2024 1000.0
4 b 2020 6000.0
5 b 2021 6000.0
6 b 2022 3000.0
CodePudding user response:
repeat
indices by df['Duration']
, iteratively change year to consecutive years and finally, groupby
['Program','Start year']
. The tricky part is to add the decimal duration time. Add that part using a mask
:
df = pd.DataFrame([['a',2021,2,'$30000'],['a',2022,3,'$3000'],['b',2020,2.5,'$15000']],
columns=['Program','Start year','Duration','Total amount'])
df['Annual amount'] = df['Total amount'].str[1:].astype(float) / df['Duration']
mask = df['Duration'].astype(str).str.split('.').str[1] != '0'
df_new = df.loc[df.index.repeat(df['Duration'])]
df.loc[mask,'Annual amount'] = df.loc[mask,'Annual amount'] * (df.loc[mask,'Duration'] % 1)
df_new = df_new.append(df[mask])
df.loc[mask,'Annual amount'] = df.loc[mask,'Total amount'].str[1:].astype(float) / (df.loc[mask,'Duration'] % 1)
for i in df_new.index.unique():
year = int(df_new.loc[i,'Start year'].iloc[0])
length = int(np.ceil(df_new.loc[i,'Duration'].iloc[0]))
df_new.loc[i,'Start year'] = range(year,year length)
df_new = df_new.groupby(['Program','Start year'])['Annual amount'].sum().reset_index()