I have a pandas DataFrame df
looking like this :
item year value
A 2010 20
A 2011 25
A 2012 32
B 2016 20
B 2019 40
B 2018 50
My goal is to be able, for each item to calculate the difference of value between each date. Then for example, I want to find for item A : 12 (32 -20 because year max is 2012 and year min is 2010) and for item B : 20 (40 - 20, because year max is 2019 and year min is 2016).
I use the following code to get, for each item, year max and year min :
df.groupby("item").agg({'year':[np.min, np.max]})
Then, I find the year min and year max for each item. However, I stuck to make what I want.
CodePudding user response:
Try sort_values
by year
, then you can groupby
and select first
for min
and last
for max
:
g = df.sort_values('year').groupby('item')
out = g['value'].last() - g['value'].first()
Output:
item
A 12
B 20
Name: value, dtype: int64
CodePudding user response:
Use:
def fun(x):
return x[x.index.max()] - x[x.index.min()]
res = df.set_index("year").groupby("item").agg(fun)
print(res)
Output
value
item
A 12
B 20
CodePudding user response:
Use loc accessor in agg to calculate value difference and also, you can also concat the first and last year in an item to give you a clear indication of the range.
df.sort_values(by=['item','year']).groupby('item').agg( year=('year', lambda x: str(x.iloc[0]) '-' str(x.iloc[-1])),value=('value', lambda x: x.iloc[-1]-x.iloc[0]))
year value
item
A 2010-2012 12
B 2016-2019 20