Home > Blockchain >  Aggregating with pd.NamedAgg with additional conditions
Aggregating with pd.NamedAgg with additional conditions

Time:10-30

I want to aggregate time-series data based on several conditions. Apart of grouping the data by a timespan and the "type"- column, I would like to count and sum only positive values in the respective groups.
Can this be done elegantly without using .filter or creating subsets beforehand, and merging the aggregate data after?

The following code has been corrected thanks to jezrael's answer below. However, you should check out his answer for a more performant solution. While I prefer "the style" of the solution that I took here, for my dataframe ~50k rows, his approach is far faster.

Sample data:

datelist = ['2021-01-01','2021-02-01','2021-03-01']
datelist = [pd.to_datetime(item) for item in datelist] 
datelist = [item for item in datelist for _ in (range(5))]
valuelist = np.random.randint(-100,100,size=(15))
typelist = np.random.randint(0,3, size=(15))
df = pd.DataFrame(
    {'values': valuelist,
     'types': typelist
    }, index = datelist)
print(df)
            values  types
2021-01-01     -91      2
2021-01-01     -32      1
2021-01-01     -88      1
2021-01-01       7      1
2021-01-01     -84      0
2021-02-01     -57      0
2021-02-01     -28      1
2021-02-01     -11      0
2021-02-01     -66      1
2021-02-01      -9      2
2021-03-01      55      2
2021-03-01     -10      0
2021-03-01     -89      1
2021-03-01      61      1
2021-03-01     -28      1

myagg = {
        'values_sum' : ('values', 'sum'),
        'values_positive_count' : ('values', lambda x: (x > 0).sum()), # counts positive values
        'values_negative_count' : ('values', lambda x: (x < 0).sum()), # counts negative values
        'values_negative_sum' : ('values', lambda x: ((x < 0)*x).sum()), # corrected the parenthesis thanks to jezraels input - works now
    }
df_agg = df.groupby([pd.Grouper(freq='D'), pd.Grouper('type')]).agg(**myagg)

Desired result:

print(dfagg)
                  sum  values_positive_count  values_negative_count  a_positive_sum
           types                                                   
2021-01-01 0      106                      3                      0               0
           1      -62                      0                      1               7
           2       -4                      0                      1               0
2021-02-01 0      -97                      0                      2               0
           1       12                      1                      1               0
           2       58                      1                      0               0
2021-03-01 0      -35                      1                      1               0
           1      111                      2                      0              55
           2       85                      1                      0              61

CodePudding user response:

There is a ( after lambda x and for match column types is not necessary for pd.Grouper:

myagg = {
        'values_sum' : ('values', 'sum'),
        'values_positive_count' : ('values', lambda x: (x > 0).sum()), 
        'values_negative_count' : ('values', lambda x: (x < 0).sum()),
        'values_negative_sum' : ('values', lambda x: ((x < 0)*x).sum()), 
}
df_agg = df.groupby([pd.Grouper(freq='D'), 'types']).agg(**myagg)

Or if you need better performance, create helper columns before groupby like this, so aggregate by sum only:

myagg = {
        'values_sum' : ('values', 'sum'),
        'values_positive_count' : ('p', 'sum'),
        'values_negative_count' : ('n', 'sum'),
        'values_negative_sum' : ('n_sum','sum')
}
df_agg = (df.assign(p = df['values'].gt(0),
                    n = df['values'].lt(0),
                    n_sum = df['values'].lt(0).mul(df['values']),
                   )
            .groupby([pd.Grouper(freq='D'), 'types'])
            .agg(**myagg))

CodePudding user response:

It makes a result what you desired:

#1. Using 'NamedAgg' (https://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#named-aggregation)

df = pd.DataFrame(
    {
        'values': valuelist,
        'types': typelist,
        'date': datelist,  # You do not need to make it as index.
    }
)

df = df.groupby(['date', 'types']).agg(
    sum=pd.NamedAgg(column='values', aggfunc=sum),
    values_positive_count=pd.NamedAgg(column='values', aggfunc=lambda x: (x>0).sum()),
    values_negative_count=pd.NamedAgg(column='values', aggfunc=lambda x: (x<0).sum()),
    a_positive_sum=pd.NamedAgg(column='values', aggfunc=lambda x: ((x > 0)*x).sum()),
)

print(df)

#2. Using groupby and rename

df = pd.DataFrame(
    {
        'values': valuelist,
        'types': typelist,
        'date': datelist,  # You do not need to make it as index.
    }
)

df = df.groupby(['date', 'types']).agg({
    'values': [sum, lambda x: (x>0).sum(), lambda x: (x<0).sum(), lambda x: ((x > 0)*x).sum()]
})

df = df.rename(columns={
    '<lambda_0>': 'values_positive_count',
    '<lambda_1>': 'values_negative_count',
    '<lambda_2>': 'a_positive_sum',
})
df = df.droplevel(0, axis=1)

print(df)
                  sum  values_positive_count  values_negative_count  a_positive_sum
date       types                                                                   
2021-01-01 0      -69                      0                      1               0
           1       49                      1                      0              49
           2       91                      2                      1             107
2021-02-01 0       -6                      1                      1              93
           1       -1                      1                      1              63
           2       14                      1                      0              14
2021-03-01 0      -67                      0                      2               0
           1       60                      1                      1              90
           2      -93                      0                      1               0
  • Related