Need to insert rows for missing dates for individuals in pandas dataframe-CodePudding

I have a data set containing donor information for several years, and I need to insert rows where a donor has skipped a year. There are several thousand records in the actual dataframe, but a sample looks like this

import pandas as pd
df = pd.DataFrame([['A','2011',10], ['A','2012',10],['A','2013',10],['B','2011',20], 
                   ['B','2013',20]],columns=['donor_id','year','donation'])
df

donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
4   B   2013    20

I need to insert a zero donation for donor B for 2012, so it should end up looking like this

donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
4   B   2012    0
5   B   2013    20

I've tried several solutions to similar problems, but haven't been successful yet. This solution looks exactly like what I need, but I lose about half the rows of the dataframe and can't figure out why that's happening.

df = pd.read_csv(r'filepath')
df = df.drop_duplicates(subset=['donor_id','year'])
df['year_DT'] = pd.to_datetime(df['year'])

df = (df.set_index('year_DT').
      groupby('donor_id').
      apply(lambda x: x.asfreq(freq='Y')).
      drop('donor_id', axis=1))

df = df.reset_index()
df["Index"] = df.groupby('donor_id').cumcount() 1

CodePudding user response：

You can .groupby() using donor_id column and on each group apply custom function.

In this function you'll merge actual group with new pd.Series made from range(<min year of this group>, <max year of this group> 1).

Afterwards, the missing rows from this merge (NaNs) are filled with actual values:

def fn(x):
    out = x.merge(
        pd.Series(range(x["year"].min(), x["year"].max()   1), name="year"),
        how="right",
    )
    out["donor_id"] = out["donor_id"].ffill()
    out["donation"] = out["donation"].fillna(0)
    return out


df["year"] = df["year"].astype(int)
df = df.groupby("donor_id").apply(fn).reset_index(drop=True)
print(df)

Prints:

  donor_id  year  donation
0        A  2011      10.0
1        A  2012      10.0
2        A  2013      10.0
3        B  2011      20.0
4        B  2012       0.0
5        B  2013      20.0

CodePudding user response：

I would try generating a new index for your dataframe and then resetting it using reindex

df.year = df.year.astype(int)

years = list(range(df['year'].astype(int).min(), df['year'].astype(int).max() 1))
ids = list(df.donor_id.unique())

new_index = pd.MultiIndex.from_product([ids, years], names=['donor_id', 'year'])

df_new = df.set_index(['donor_id', 'year'])
df_new.reindex(new_index, fill_value=0)
df_new = df_new.reset_index()

# Output:
    donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
4   B   2012    0
5   B   2013    20

CodePudding user response：

unique =  df.year.unique()
data = df.groupby("donor_id").agg({"year":lambda x: set(unique) - set(x)}).explode("year").dropna().reset_index()
data["donation"] = 0
pd.concat([df, data]).sort_values(["donor_id", "year"])

output:

donor_id    year    donation
0   A   2011    10
1   A   2012    10
2   A   2013    10
3   B   2011    20
0   B   2012    0
4   B   2013    20