I have data of many companies by month (End of Month). I want to create a new columns with groupby
for each company
where:
new_col
from Jul of this year to Jun of next year will take value of Dec last year- For example,
new_col
value from Jul-2000 to Jun-2001 will be equal value ofold_col
Dec-1999
You can download sample data here: https://www.dropbox.com/s/oz1ltblh6u0chzt/tem_20220506.csv?dl=0
I have been trying with this code line but unsuccessful:
df = pd.read_csv('tem_20220506.csv', parse_dates=['date'])
df.drop(columns=['new_col', 'Note'], inplace=True)
df = df.set_index('date').rename_axis(None)
df['new_col'] = df.groupby('comp').apply(lambda g: --- ) # ← I am now stuck here
Desire output:
comp old_col new_col \
2000-01-31 a 1 NaN
2000-02-29 a 2 NaN
2000-03-31 a 3 NaN
2000-04-30 a 4 NaN
2000-05-31 a 5 NaN
2000-06-30 a 6 NaN
2000-07-31 a 7 NaN
2000-08-31 a 8 NaN
2000-09-30 a 9 NaN
2000-10-31 a 10 NaN
2000-11-30 a 11 NaN
2000-12-31 a 12 NaN
2001-01-31 a 13 NaN
2001-02-28 a 14 NaN
2001-03-31 a 15 NaN
2001-04-30 a 16 NaN
2001-05-31 a 17 NaN
2001-06-30 a 18 NaN
2001-07-31 a 19 12.000
2001-08-31 a 20 12.000
2001-09-30 a 21 12.000
2001-10-31 a 22 12.000
2001-11-30 a 23 12.000
2001-12-31 a 24 12.000
2002-01-31 a 25 12.000
2002-02-28 a 26 12.000
2002-03-31 a 27 12.000
2002-04-30 a 28 12.000
2002-05-31 a 29 12.000
2002-06-30 a 30 12.000
2002-07-31 a 31 24.000
2002-08-31 a 32 24.000
2002-09-30 a 33 24.000
2002-10-31 a 34 24.000
2002-11-30 a 35 24.000
2002-12-31 a 36 24.000
2000-01-31 b 101 NaN
2000-02-29 b 102 NaN
2000-03-31 b 103 NaN
2000-04-30 b 104 NaN
2000-05-31 b 105 NaN
2000-06-30 b 106 NaN
2000-07-31 b 107 NaN
2000-08-31 b 108 NaN
2000-09-30 b 109 NaN
2000-10-31 b 110 NaN
2000-11-30 b 111 NaN
2001-01-31 b 113 NaN
2001-02-28 b 114 NaN
2001-03-31 b 115 NaN
2001-04-30 b 116 NaN
2001-05-31 b 117 NaN
2001-06-30 b 118 NaN
2001-07-31 b 119 NaN
2001-08-31 b 120 NaN
2001-09-30 b 121 NaN
2001-10-31 b 122 NaN
2001-11-30 b 123 NaN
2001-12-31 b 124 NaN
2002-01-31 b 125 NaN
2002-02-28 b 126 NaN
2002-03-31 b 127 NaN
2002-04-30 b 128 NaN
2002-05-31 b 129 NaN
2002-06-30 b 130 NaN
2002-07-31 b 131 124.000
2002-08-31 b 132 124.000
2002-10-31 b 134 124.000
2002-11-30 b 135 124.000
2002-12-31 b 136 124.000
(!!) Note that: for comp==b
:
It is NaN from Jul-2001 to Jun-2002 because Dec-2000 value is
missing
There is
missing
Sep-2002, but it is ok
CodePudding user response:
df = pd.read_csv('tem_20220506.csv', parse_dates=['date'])
df.drop(columns=['new_col', 'Note'], inplace=True)
df.set_index('date', inplace=True)
using a helper function to get new col based on old col
def helper_func(x):
# get the date values corresponding to month = 12
req_values = x[x.index.month == 12].to_dict()['old_col']
# iterate over those dates and replace July to June range depending on the year of the date
for date_value, old_col_value in req_values.items():
x.loc[f'{date_value.year 1}-07-31':f'{date_value.year 2}-06-30', 'new_col'] = old_col_value
return x
df['new_col'] = df.groupby('comp')[['old_col']].apply(helper_func)['new_col']
This will provide dataframe as your desired output
An alternate helper function
def helper_fun2(x):
"""
1. iterate over years
2. update July to June next two years value, using DEC value of current year
"""
for year in x.index.year.unique():
if f'{year}-12-31' in x.index:
x.loc[f'{year 1}-07-31':f'{year 2}-06-30', 'new_col'] = x.loc[f'{year}-12-31']['old_col']
return x