pandas groupby only aggregating rows that are common between two consecutive fields that are grouped-CodePudding

I am trying to calculate a sum for each date field, however I only want to calculate the sum of IDs that in both the current and next date, so a rolling comparison of IDs and then a groupby sum. Currently I have to loop over the dataframe which is very slow.

For example my df:

df = pd.DataFrame({
   'Date': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
   'ID': [ 1, 2, 3, 4 , 2, 3, 4 , 2, 3, 4,5 , 1, 2, 3, 4],
   'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})

Ideally I want to group the dataframe by Date and to only sum the IDs that are common between two dates, for example below. However this is very slow.

tmpL = df.groupby('Date')['ID'].apply(list)
tmpV = df.groupby('Date')['Value'].sum()
for i in range(1, tmpL.shape[0]):
    res = list(set(tmpL.iloc[i]) - set(tmpL.iloc[i - 1]))
    v = df.loc[ df.ID.isin(res) & (df.Date == tmpL.index[i]), 'Value'].sum()
    tmpV.iloc[i] = tmpV.iloc[i] - v
tmpV

Date
1    10
2    18
3    27
4    42
Name: Value, dtype: int64

Is there a way to do this in pandas without looping over the dataframe?

CodePudding user response：

Use DataFrame.pivot_table with aggregate sum, compare for not equal with DataFrame.diff, and last passed to DataFrame.mask with sum:

df1 = df.pivot_table(index='Date', columns='ID', values='Value', aggfunc='sum')
s = df1.mask(df1.notna().diff().fillna(False)).sum(axis=1)
print (s)
Date
1    10.0
2    18.0
3    27.0
4    42.0
dtype: float64

First solution, I think slowier:

You can get all not matched sets by convert original to sets, then use Series.diff, Series.explode and get all matched values of original by DataFrame.merge, last aggregate sum and subtract:

tmpL = (df.groupby('Date')['ID'].apply(set)
          .diff()
          .explode()
          .reset_index()
          .merge(df)
          .groupby('Date')['Value']
          .sum())
tmpV = df.groupby('Date')['Value'].sum()

out = tmpV.sub(tmpL, fill_value=0)
print (out)
Date
1    10.0
2    18.0
3    27.0
4    42.0

CodePudding user response：

Try:

df = df.pivot_table(index='Date', columns='ID', values='Value')#.reset_index()
condition = df.notna() & df.notna().shift(1)
condition.iloc[0,:]=True
print(df[condition].sum(axis=1))

Output: