I have a data set containing donor information for several years, and I need to insert rows where a donor has skipped a year. There are several thousand records in the actual dataframe, but a sample looks like this
import pandas as pd
df = pd.DataFrame([['A','2011',10], ['A','2012',10],['A','2013',10],['B','2011',20],
['B','2013',20]],columns=['donor_id','year','donation'])
df
donor_id year donation
0 A 2011 10
1 A 2012 10
2 A 2013 10
3 B 2011 20
4 B 2013 20
I need to insert a zero donation for donor B for 2012, so it should end up looking like this
donor_id year donation
0 A 2011 10
1 A 2012 10
2 A 2013 10
3 B 2011 20
4 B 2012 0
5 B 2013 20
I've tried several solutions to similar problems, but haven't been successful yet. This solution looks exactly like what I need, but I lose about half the rows of the dataframe and can't figure out why that's happening.
df = pd.read_csv(r'filepath')
df = df.drop_duplicates(subset=['donor_id','year'])
df['year_DT'] = pd.to_datetime(df['year'])
df = (df.set_index('year_DT').
groupby('donor_id').
apply(lambda x: x.asfreq(freq='Y')).
drop('donor_id', axis=1))
df = df.reset_index()
df["Index"] = df.groupby('donor_id').cumcount() 1
CodePudding user response:
You can .groupby()
using donor_id
column and on each group apply custom function.
In this function you'll merge actual group with new pd.Series
made from range(<min year of this group>, <max year of this group> 1)
.
Afterwards, the missing rows from this merge (NaNs
) are filled with actual values:
def fn(x):
out = x.merge(
pd.Series(range(x["year"].min(), x["year"].max() 1), name="year"),
how="right",
)
out["donor_id"] = out["donor_id"].ffill()
out["donation"] = out["donation"].fillna(0)
return out
df["year"] = df["year"].astype(int)
df = df.groupby("donor_id").apply(fn).reset_index(drop=True)
print(df)
Prints:
donor_id year donation
0 A 2011 10.0
1 A 2012 10.0
2 A 2013 10.0
3 B 2011 20.0
4 B 2012 0.0
5 B 2013 20.0
CodePudding user response:
I would try generating a new index for your dataframe and then resetting it using reindex
df.year = df.year.astype(int)
years = list(range(df['year'].astype(int).min(), df['year'].astype(int).max() 1))
ids = list(df.donor_id.unique())
new_index = pd.MultiIndex.from_product([ids, years], names=['donor_id', 'year'])
df_new = df.set_index(['donor_id', 'year'])
df_new.reindex(new_index, fill_value=0)
df_new = df_new.reset_index()
# Output:
donor_id year donation
0 A 2011 10
1 A 2012 10
2 A 2013 10
3 B 2011 20
4 B 2012 0
5 B 2013 20
CodePudding user response:
unique = df.year.unique()
data = df.groupby("donor_id").agg({"year":lambda x: set(unique) - set(x)}).explode("year").dropna().reset_index()
data["donation"] = 0
pd.concat([df, data]).sort_values(["donor_id", "year"])
output:
donor_id year donation
0 A 2011 10
1 A 2012 10
2 A 2013 10
3 B 2011 20
0 B 2012 0
4 B 2013 20