I am trying to calculate a sum for each date
field, however I only want to calculate the sum of IDs that in both the current and next date
, so a rolling
comparison of IDs and then a groupby
sum. Currently I have to loop over the dataframe which is very slow.
For example my df:
df = pd.DataFrame({
'Date': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
'ID': [ 1, 2, 3, 4 , 2, 3, 4 , 2, 3, 4,5 , 1, 2, 3, 4],
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})
Ideally I want to group the dataframe by Date
and to only sum the IDs
that are common between two dates, for example below. However this is very slow.
tmpL = df.groupby('Date')['ID'].apply(list)
tmpV = df.groupby('Date')['Value'].sum()
for i in range(1, tmpL.shape[0]):
res = list(set(tmpL.iloc[i]) - set(tmpL.iloc[i - 1]))
v = df.loc[ df.ID.isin(res) & (df.Date == tmpL.index[i]), 'Value'].sum()
tmpV.iloc[i] = tmpV.iloc[i] - v
tmpV
Date
1 10
2 18
3 27
4 42
Name: Value, dtype: int64
Is there a way to do this in pandas
without looping over the dataframe?
CodePudding user response:
Use DataFrame.pivot_table
with aggregate sum
, compare for not equal with DataFrame.diff
, and last passed to DataFrame.mask
with sum
:
df1 = df.pivot_table(index='Date', columns='ID', values='Value', aggfunc='sum')
s = df1.mask(df1.notna().diff().fillna(False)).sum(axis=1)
print (s)
Date
1 10.0
2 18.0
3 27.0
4 42.0
dtype: float64
First solution, I think slowier:
You can get all not matched sets by convert original to set
s, then use Series.diff
, Series.explode
and get all matched values of original by DataFrame.merge
, last aggregate sum
and subtract:
tmpL = (df.groupby('Date')['ID'].apply(set)
.diff()
.explode()
.reset_index()
.merge(df)
.groupby('Date')['Value']
.sum())
tmpV = df.groupby('Date')['Value'].sum()
out = tmpV.sub(tmpL, fill_value=0)
print (out)
Date
1 10.0
2 18.0
3 27.0
4 42.0
CodePudding user response:
Try:
df = df.pivot_table(index='Date', columns='ID', values='Value')#.reset_index()
condition = df.notna() & df.notna().shift(1)
condition.iloc[0,:]=True
print(df[condition].sum(axis=1))
Output:
Date
1 10.0
2 18.0
3 27.0
4 42.0