I am trying to calculate value growth/decline using the minimum date and maximum date. My data currently looks like this:
Code Date Value 0 A 2020-12-31 80122.0 1 A 2019-12-31 45472.0 2 A 2018-12-31 31917.0 3 A 2017-12-31 23432.0 4 B 2020-12-31 0.0
For Code A I need to keep the max (2020-12-31) and min (2017-12-31) dates as well as the values so I can calculate the difference later on. I have multiple codes and need to be able to apply the same logic to each one. Any suggestions on the best way to approach this?
Thanks
CodePudding user response:
In your case, you want to sort the date, then groupby and extract first, last:
df.sort_values(['Code','Date']).groupby('Code')['Value'].agg(['first','last'])
Output:
first last
Code
A 23432.0 80122.0
B 0.0 0.0
CodePudding user response:
I would first sort_values
then you can drop_duplicates
on 'Code'. Using different logic for keep
this allows you to get the first and last row (based on Date) within each 'Code', which you can then subtract to get the day difference and Value difference for each code.
df = df.sort_values(['Code', 'Date'])
(df.drop_duplicates('Code', keep='last').set_index('Code')
- df.drop_duplicates('Code', keep='first').set_index('Code'))
# Date Value
#Code
#A 1096 days 56690.0
#B 0 days 0.0
Alternatively if you don't just need the difference and actually need the rows, then I would concat
those together instead of subtracting. The main reason to avoid the .first
aggregation is because it does not guarantee data come from the same rows (without specifying dropna
) in the case of null values.
pd.concat([df.drop_duplicates('Code', keep='last').set_index('Code'),
df.drop_duplicates('Code', keep='first').set_index('Code')],
keys=['Last', 'First'], axis=1)
# Last First
# Date Value Date Value
#Code
#A 2020-12-31 80122.0 2017-12-31 23432.0
#B 2020-12-31 0.0 2020-12-31 0.0
CodePudding user response:
since you
need to keep the max (2020-12-31) and min (2017-12-31) dates as well as the values...
, you can try:
df = pd.DataFrame({'Code':['A','A','A','A','B'],
'Date': ['2020-12-31', '2019-12-31', '2018-12-31', '2017-12-31', '2020-12-31'],
'Value': [80122.0, 45472.0, 31917.0, 23432.0, 0.0]
}, )
df.loc[:, 'Date'] = pd.to_datetime(df.loc[:, 'Date'])
is the df mentioned:
Code Date Value
0 A 2020-12-31 80122.0
1 A 2019-12-31 45472.0
2 A 2018-12-31 31917.0
3 A 2017-12-31 23432.0
4 B 2020-12-31 0.0
so another way can be:
dictionary = {}
for code in df.loc[:, 'Code'].unique():
dictionary[code] = {'Date min': df.loc[df.loc[:, 'Code']==code,'Date'].min(),
'Value min': df.loc[(df.loc[:, 'Code']==code)& (df.loc[:,'Date'] == df.loc[df.loc[:, 'Code']==code,'Date'].min()), 'Value'].values[0],
'Date max': df.loc[df.loc[:, 'Code']==code,'Date'].max(),
'Value max':df.loc[(df.loc[:, 'Code']==code)&(df.loc[:,'Date'] == df.loc[df.loc[:, 'Code']==code,'Date'].max()), 'Value'].values[0]
}
resume = pd.DataFrame(dictionary)
resume = resume.transpose()
resume
that outputs:
Date min Value min Date max Value max
A 2017-12-31 23432.0 2020-12-31 80122.0
B 2020-12-31 0.0 2020-12-31 0.0