Home > database >  pandas groupby only aggregating rows that are common between two consecutive fields that are grouped
pandas groupby only aggregating rows that are common between two consecutive fields that are grouped

Time:10-06

I am trying to calculate a sum for each date field, however I only want to calculate the sum of IDs that in both the current and next date, so a rolling comparison of IDs and then a groupby sum. Currently I have to loop over the dataframe which is very slow.

For example my df:

df = pd.DataFrame({
   'Date': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
   'ID': [ 1, 2, 3, 4 , 2, 3, 4 , 2, 3, 4,5 , 1, 2, 3, 4],
   'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})

Ideally I want to group the dataframe by Date and to only sum the IDs that are common between two dates, for example below. However this is very slow.

tmpL = df.groupby('Date')['ID'].apply(list)
tmpV = df.groupby('Date')['Value'].sum()
for i in range(1, tmpL.shape[0]):
    res = list(set(tmpL.iloc[i]) - set(tmpL.iloc[i - 1]))
    v = df.loc[ df.ID.isin(res) & (df.Date == tmpL.index[i]), 'Value'].sum()
    tmpV.iloc[i] = tmpV.iloc[i] - v
tmpV

Date
1    10
2    18
3    27
4    42
Name: Value, dtype: int64

Is there a way to do this in pandas without looping over the dataframe?

CodePudding user response:

Use DataFrame.pivot_table with aggregate sum, compare for not equal with DataFrame.diff, and last passed to DataFrame.mask with sum:

df1 = df.pivot_table(index='Date', columns='ID', values='Value', aggfunc='sum')
s = df1.mask(df1.notna().diff().fillna(False)).sum(axis=1)
print (s)
Date
1    10.0
2    18.0
3    27.0
4    42.0
dtype: float64

First solution, I think slowier:

You can get all not matched sets by convert original to sets, then use Series.diff, Series.explode and get all matched values of original by DataFrame.merge, last aggregate sum and subtract:

tmpL = (df.groupby('Date')['ID'].apply(set)
          .diff()
          .explode()
          .reset_index()
          .merge(df)
          .groupby('Date')['Value']
          .sum())
tmpV = df.groupby('Date')['Value'].sum()

out = tmpV.sub(tmpL, fill_value=0)
print (out)
Date
1    10.0
2    18.0
3    27.0
4    42.0

CodePudding user response:

Try:

df = df.pivot_table(index='Date', columns='ID', values='Value')#.reset_index()
condition = df.notna() & df.notna().shift(1)
condition.iloc[0,:]=True
print(df[condition].sum(axis=1))

Output:

Date
1    10.0
2    18.0
3    27.0
4    42.0
  • Related