How to create a new metric column based on 1 year lag of a date column?-CodePudding

I would like create a new column which references a date column - 1 year and displays the corresponding values:

import pandas as pd

Input DF
df = pd.DataFrame({'consumption': [0,1,3,5], 'date':[pd.to_datetime('2017-04-01'),
                                             pd.to_datetime('2017-04-02'),
                                             pd.to_datetime('2018-04-01'),
                                             pd.to_datetime('2018-04-02')]})    
>>> df
   consumption       date
    0            2017-04-01
    1            2017-04-02
    3            2018-04-01
    5            2018-04-02


Expected DF

df = pd.DataFrame({'consumption': [0,1,3,5], 
              'prev_year_consumption': [np.NAN,np.NAN,0,1],
              'date':[pd.to_datetime('2017-04-01'),
                        pd.to_datetime('2017-04-02'),
                        pd.to_datetime('2018-04-01'),
                        pd.to_datetime('2018-04-02')]})  

  >>> df
      consumption  prev_year_consumption       date
              0                      NAN     2017-04-01
              1                      NAN     2017-04-02
              3                      0       2018-04-01
              5                      1       2018-04-02

So prev_year_consumption is are simply values from the consumption column where 1 year is subtracted from date dynamically.

in SQL I would probably do something like

SELECT df_past.consumption as prev_year_consumption, df_current.consumption
FROM df as df_current 
LEFT JOIN ON df df_past ON year(df_current.date) = year(df_past.date) - 1

Appreciate any hints

CodePudding user response：

The notation in pandas is similar. We are still doing a self merge however we need to specify that the right_on (or left_on) has a DateOffset of 1 year:

new_df = df.merge(
    df,
    left_on='date',
    right_on=df['date']   pd.offsets.DateOffset(years=1),
    how='left'
)

new_df:

        date  consumption_x     date_x  consumption_y     date_y
0 2017-04-01              0 2017-04-01            NaN        NaT
1 2017-04-02              1 2017-04-02            NaN        NaT
2 2018-04-01              3 2018-04-01            0.0 2017-04-01
3 2018-04-02              5 2018-04-02            1.0 2017-04-02

We can further drop and rename columns to get exact output:

new_df = df.merge(
    df,
    left_on='date',
    right_on=df['date']   pd.offsets.DateOffset(years=1),
    how='left'
).drop(columns=['date_x', 'date_y']).rename(columns={
    'consumption_y': 'prev_year_consumption'
})

new_df:

        date  consumption_x  prev_year_consumption
0 2017-04-01              0                    NaN
1 2017-04-02              1                    NaN
2 2018-04-01              3                    0.0
3 2018-04-02              5                    1.0