I have a DataFrame with 3 columns: POD (which is a code), timestamp, EAI_ALL (number). I want to calculate a 4th column where every row has the following value: the value of EAI_ALL of the current row minus the value of EAI_ALL on the previous row. This has to be done for every code (e.g. if the current code is 2 and the code in the previous row is 1, the calculated value has to be 0 because the code is different). I managed to do this operation with the following code:
#group the DF by POD code
grouped = df.groupby('POD')
#Define Lambda function
eai_diff = lambda x: x.EAI_ALL - x.EAI_ALL.shift( 1)
df['EAI_Delta'] = grouped.apply(eai_diff).reset_index(0, drop=True).fillna(0)
This works fine, unless I have only one POD code in the DataFrame. I keep getting this error if I apply the function to a DataFrame with only one POD code.
ValueError: Wrong number of items passed 48, placement implies 1
As you can see df has 48 rows, but with the group by it gets reduced to one row. I need the groupby because if I have more than one POD code the operation basically has to be done for one POD code at a time.
Does anyone have any suggestion? Thanks in advance!
CodePudding user response:
This is an error that comes back regularly, which comes from the fact that returning a series from a dataframe apply
with a single group returns a dataframe with a single row, and the index as columns:
>>> df
POD EAI_ALL
0 foo 0
1 foo 1
2 foo 2
3 foo 3
4 foo 4
5 foo 5
6 foo 6
7 foo 7
8 foo 8
>>> df.groupby('POD').apply(lambda x: x.EAI_ALL - x.EAI_ALL.shift( 1))
EAI_ALL 0 1 2 3 4 5 6 7 8
POD
foo NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
To avoid that, and if you really need .apply
, you can return a dataframe instead:
>>> df.groupby('POD').apply(lambda x: x[['EAI_ALL']] - x[['EAI_ALL']].shift( 1))
EAI_ALL
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 1.0
7 1.0
8 1.0
However the better way is to try and stay away from .apply
. For your use case, groupby.shift()
works perfectly and only shifts within the group as you expect:
>>> df.groupby('POD')['EAI_ALL'].shift()
0 NaN
1 0
2 1
3 2
4 3
5 4
6 5
>>> df['EAI_Delta'] = (df['EAI_ALL'] - df.groupby('POD')['EAI_ALL'].shift()).fillna(0)
>>> df
POD EAI_ALL EAI_Delta
0 foo 0 0.0
1 foo 1 1.0
2 foo 2 1.0
3 foo 3 1.0
4 foo 4 1.0
5 foo 5 1.0
6 foo 6 1.0
7 foo 7 1.0
8 foo 8 1.0
And an example with several POD
values:
>>> df2['EAI_Delta'] = (df2['EAI_ALL'] - df2.groupby('POD')['EAI_ALL'].shift()).fillna(0)
>>> df2
POD EAI_ALL EAI_Delta
0 foo 0 0.0
1 foo 1 1.0
2 foo 2 1.0
3 foo 3 1.0
4 bar 4 0.0
5 bar 5 1.0
6 bar 6 1.0
7 bar 7 1.0
8 bar 8 1.0
Even better as suggested by @mandiatodos in comments, you can directly use .groupby().diff()
CodePudding user response:
If you use pandas version less than 1.1.0
you can set squeeze
parameter of groupby
to True
:
import pandas as pd
def group_with_squeeze(df):
grouped = df.groupby('POD', squeeze=True)
eai_diff = lambda x: x.EAI_ALL - x.EAI_ALL.shift( 1)
df['EAI_Delta'] = grouped.apply(eai_diff).reset_index(0, drop=True).fillna(0)
df = pd.DataFrame({'POD': list('ABBBCC'), 'EAI_ALL': [1, 2, 3, 1, 4, 4]})
group_with_squeeze(df)
df
# POD EAI_ALL EAI_Delta
# 0 A 1 0.0
# 1 B 2 0.0
# 2 B 3 1.0
# 3 B 1 -2.0
# 4 C 4 0.0
# 5 C 4 0.0
df = pd.DataFrame({'POD': list('AAAAAA'), 'EAI_ALL': [1, 2, 3, 1, 4, 4]})
group_with_squeeze(df)
df
# POD EAI_ALL EAI_Delta
# 0 A 1 0.0
# 1 A 2 1.0
# 2 A 3 1.0
# 3 A 1 -2.0
# 4 A 4 3.0
# 5 A 4 0.0