Home > Software design >  How to transform panda dataframe based on date and name
How to transform panda dataframe based on date and name

Time:07-07

I'm trying to transform a dataframe without but haven't achieved my desired output, would appreciate some help:

Input data:

date         name   value
2022-07-01   Anna    5
2022-07-01   Jim     3
2022-04-29   Anna    4
2022-04-29   Jim     2
2022-04-21   Anna    1
2021-07-29   Jim     5
2021-07-29   Anna    5

Desired output:

Name         Last_30_days   Last_365_days
Anna            5              15
Jim             3              10

My closest attempt :

df_new = (df.set_index('date')
        .groupby('name', sort=False)
        .resample('MS')['value'].sum()
        .groupby(level=0)
        .cumsum()
        .reset_index(name='value'))

CodePudding user response:

Try this:

daydiff = (pd.Timestamp('now') - df['date']).dt.days
bins = [0, 30, 365, np.inf]
labels = [f'Last_{i}_days' for i in bins[1:-1]]   ['Greater than 365 days']
df['period'] = pd.cut(daydiff, bins=bins, labels=labels)
df_out = df.pivot_table('value', 'name', 'period', aggfunc='sum')
print(df_out)

Output:

period  Last_30_days  Last_365_days  Greater than 365 days
name                                                     
Anna               5             10                      0
Jim                3              7                      0

And, you can add a cumsum to get cumulative totals along the rows:

daydiff = (pd.Timestamp('now') - df['date']).dt.days
bins = [0,30,365,np.inf]
labels = [f'Last_{i}_days' for i in bins[1:-1]]   ['Greater than 365 days']
df['period'] = pd.cut(daydiff, bins=bins, labels=labels)
df_out = df.pivot_table('value', 'name', 'period', aggfunc='sum')
df_out = df_out.cumsum(axis=1)
print(df_out)

Output:

period  Last_30_days  Last_365_days  Greater than 365 days
name                                                     
Anna               5             15                     15
Jim                3             10                     10

CodePudding user response:

Calculate new binary column to check whether the day is included, then pivot, then concat all selected day differences.

days_select = [30, 60, 365]
pd.concat([df.assign(in_last_n_days=(pd.Timestamp.today() - df.date) < pd.Timedelta(f"{ndays}d"))
             .query("in_last_n_days")
             .pivot_table(index="name", values="value", aggfunc=sum)
             .rename(columns={"value": f"last_{ndays}_days"})
           for ndays in days_select], axis=1)

      last_30_days  last_60_days  last_365_days
name                                           
Anna             5             5             15
Jim              3             3             10

CodePudding user response:

df_pivot = pd.pivot_table(df, values='value',index='name',columns='date')
df_pivot.fillna(0,inplace=True)
expected_df  = pd.DataFrame(df_pivot['2021-07-29']   df_pivot['2022-04-21']   df_pivot['2022-04-29']   df_pivot['2022-07-01'])
expected_df['last_365_days'] = df_pivot['2021-07-29']   df_pivot['2022-04-21']   df_pivot['2022-04-29']   df_pivot['2022-07-01']
del expected_df[0]
expected_df['last_30_days'] = df_pivot['2022-07-01']
  • Related